2012’s DB colloquium

Below you find a list of last year's DB colloquia, usually Tuesday's from 13:45h. – 14:30h. in ZI-3126.

26 January 2012 (Thursday at 16.00 – in Dutch)
Design Project Group
Quetis – a self-configuring interface between devices
Quetis is a software tool that allows communication with paralyzed patients by interactively calibrating and configuring any given set of human input devices that have a compatible middleware driver for the patient. Quetis detects user capabilities per device and configures itself to only utilize that subset of actions that the patient can actually use and maps these to a configuration driving a premade GUI, though it could use any generic output system with proper middleware. In short: Quietis is a generalized and self-configuring interface between specialistic input devices and the environment for paralyzed patients in ICU’s.

27 February 2012
Tjitze Rienstra (University of Luxembourg)
Argumentation Theory
In the theory of abstract argumentation, the acceptance status of arguments is normally determined for the complete set of arguments at once, under a single semantics. However, this is not always desired. For example, in multi-agent systems, the provenance of the arguments and the competence of the agents often suggest different evaluation criteria for different arguments.

13 March 2012
Robin Aly
AXES: Access to Audiovisual Archives
The EU Project AXES aims at opening large audio-visual archives to a variety of user groups. My main task within the project is to provide search and linking functionality. Meanwhile, the project the first year of the project has passed and its progress will be reviewed next week. This colloquium is a rehearsal of the presentation I will give there. I will also provide a general overview of the project and demos of existing work.

28 March 2012 (Wednesday at 15.00h.-15.45h.)
Dong Nguyen
Evaluating federated search on the Web
Federated search systems have previously been evaluated by reusing existing TREC datasets. However, these datasets do not reflect realistic search systems found on the Web. As a result, it has been difficult to assess whether these systems are suitable for federated search on the Web. We therefore introduce a new dataset containing more than hundred actual search engines. We first discuss the design and present several analyses of the dataset. We then compare several popular resource selection methods and discuss the results. Several suggestions/modifications to incorporate more Web specific features are then presented.

28 March 2012 (Wednesday at 10.45h.-11.45h. in CR-1B)
Henning Rode (Textkernel)
Structured Retrieval in Practice
In this talk I will give a demo of a CV search system build for job recruiters, and describe the challenges of building the system such as e.g. user-friendly faceted search, synonym handling, and location search. When searching on richly structured documents such as CVs we also encountered a number of ranking problems using the standard language modelling approach for retrieval. The second part of the presentation will therefore discuss these issues in more detail and explain why they require field-specific solutions. Finally, I will share some ideas on how to further improve the search experience by making use of large domain knowledge sources.

3 April 2012
Maarten Fokkinga
Database Design
what you’ve been doing always but never was fully aware of
We show how to construct (in an almost algorithmic way) a query formulation for a database schema, out of a (arguably simpler) query formulation in terms of an Entity-Relationship diagram. To do so requires first a thorough understanding of the construction of a database schema out of the ER diagram. For this latter task, we show how express the relations between varies steps in the development of the database schema, and what proof obligations exist.

17 April 2012
Lesley Wevers
A functional database programming language
We explore the possibilities of using functional languages in database management. We will develop a prototype implementation and compare it to the traditional approach of general purpose language and database management systems on aspects of performance and usability.

1 May 2012
Juan Amiguet, Rezwan Huq, and Andreas Wombacher
Data Processing – provenance and propagation
We want to shortly introduce two case studies we are currently working on and which Juan and Rezwan will use for evaluating their research. This talk hopefully allows to discuss the differences between propagation investigated by Juan and provenance investigated by Rezwan.

2 May 2012 (Wednesday at 11:30h. in ZI-4126)
Dolf Trieschnigg
An Exploration of Language Identification Techniques for Dutch Folktales
The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low.
Read more

12 June 2012
Robin Aly
Statistical Shard Selection
Large search engines partitions their documents into so-called shards (each shard contains the index of many documents). For a query, the search engine has to decide which shards should be used for searching (usually the top-n). To decide which shards to use, the shards are represented by sample documents which are retrieved in a first retrieval step. However, generating good document samples is not trivial and requires storage space. In this talk, I want to reflect my ideas for the shard selection problem. The ideas are based on the fact most current retrieval models are simply weighted sums of features, for which simple statistic laws exist: the expectation of the sum is the sum of its expectations, and the variance of the sum is the sum of its variances. I propose to represent shards by their expected feature value, the feature variance and co-variances. Using this representation, I hope one can determine the score distribution for the current query in each shard. The shards to select are those which have a fair chance to contain documents with a high score than a certain threshold, according to this distribution.

18 June 2012 (Monday all day)
CTIT Symposium 2012
ICT: The Innovation Highway
In EU’s Horizon 2020 three objectives have been set: excellent research, competitive industries, and better society. ICT plays a central role in reaching these goals. At this year’s CTIT symposium we will take up the challenges defined at the national and European level. Challenges which, when solved, will lead us to a better future. Together with you we will show that ICT is the true innovation highway.
Read more

19 June 2012
Design Projects
Yasona: a peer-to-peer social network
Our goal was to design a peer-to-peer network structure to create a decentralized social network. The design should not focus on social aspects of networking, or the various possibilities a social medium could have, per se, but should instead offer a platform on wich such elements can be build. Yasona was developed in order to give people the opportunity to communicate with each other and share media without being being dependent on a central server. We will demonstrate our prototype and discuss our goals, design choices and recommendations.

Design, implementation and evaluation of the Bata App
For the Batavierenrace an Android smart-phone app has been designed and implemented which –in a well organized way– shows all available information from the official Batavierenrace organization (like teams, running times, standing, etc). After some initial updates, the app has performed satisfactorily, without bugs, and was downloaded over two thousand times (being in the top 10 of Google market). The design, problems encountered, and experiences will be discussed during the talk.

17 July 2012
Brend Wanders
Semi-structured data in a wiki
Wikis offer free form editing of, and collaboration on, texts. These texts are usually of an informative nature, and are intended for consumption by people. By embedding semi-structured information in a wiki, the information can also be used by other systems. In this short talk I will present my take on using a wiki as a basis for the collaborative creation and curation of data sets by offering ad-hoc data entry and querying.

15 August 2012 (Wednesday at 13.30h.)
Nick Barkas (Spotify, Sweden)
Search at Spotify
Nick Barkas is a software developer at Spotify in Stockholm, Sweden, working mostly with backend/server-side systems. He studied scientific computing at KTH in Stockholm and the University of Washington in Seattle. Barkas will talk about how Spotify serves music metadata to users and how that relates to search.

11 September 2012
Mohammed Salem (Humboldt University, Berlin)
Journalistic Multimedia Data Analytics
In this project we propose to develop applications and tools around content-based journalistic data management, analysis, retrieval and visualization. New algorithms are needed for automatic extraction of content related metadata and annotation not only for text documents but also for news videos, images, audio signals and animations. Moreover, new retrieval methods are needed that utilize the multimodality nature of the news data and are able to return different materials related to a certain news story.

25 September 2012
Mena Habib
Toponym extraction and disambiguation
Toponym extraction and disambiguation are key topics recently addressed by fields of Information Extraction and Geographical Information Retrieval. Toponym extraction and disambiguation are highly dependent processes. Not only toponym extraction effectiveness affects disambiguation, but also disambiguation results may help improving extraction accuracy. In this paper we propose a hybrid toponym extraction approach based on Hidden Markov Models (HMM) and Support Vector Machines (SVM). Hidden Markov Model is used for extraction with high recall and low precision. Then SVM is used to find false positives based on informativeness features and coherence features derived from the disambiguation results. Experimental results conducted with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms showed that the proposed approach outperform the state of the art methods of extraction and also proved to be robust. Robustness is proved on three aspects: language independence, high and low HMM threshold settings, and limited training data.

28 September 2012 (Friday at 14.00h.)
Hans Wormer (Almere Data Capital)
Growing with Big Data
Hans Wormer is program mamanager of Almere Data Capital. The term Data Capital refers to the concentration of companies, services, knowledge and facilities that support the collection, storage, access, sharing, editing and visualization of big data. The program Almere Data Capital brings together supply and demand safely and efficiently. Hans Wormer will address: Almere's vision on the developments around Big Data; Almer's approach to stimulate new activities; the creation of new jobs for the city and region; and finally, how to get involved in Almere Data Capital.

1 October 2012 (Monday 13.30h. in ZI-3126)
Victor de Graaff (with an introduction from Djoerd Hiemstra)
The theory behind scrum
Djoerd will give a 6 minutes and 40 seconds, “pecha kucha” introduction on the plans for the module “Data & Information” of the new Computer Science bachelor.
Victor will give a 45 minute presentation on the theory behind Scrum, an increasingly popular software development methodology. Scrum is an implementation of Agile development, and is based on the concept of the capabilities of the team to plan and review their own work.

23 October 2012
Rezwan Huq
From Scripts Towards Provenance Inference
Scientists require provenance information either to validate their model or to investigate the origin of an unexpected value. However, they do not maintain any provenance information and even designing the processing workflow is rare in practice. Therefore, in this paper, we propose a solution that can build the workflow provenance graph by interpreting the scripts used for actual processing. Further, scientists can request fine-grained provenance information facilitating the inferred workflow provenance. We also provide a guideline to customize the workflow provenance graph based on user preferences. Our evaluation shows that the proposed approach is relevant and suitable for scientists to manage provenance.

12 November 2012 (Monday, 12.30h. in ZI-2126)
Iwe Muiser
Cleaning up and Standardizing a Folktale Corpus for Humanities Research
Recordings in the field of folk narrative have been made around the world for many decades. By digitizing and annotating these texts, they are frozen in time and are better suited for searching, sorting and performing research on. This paper describes the first steps of the process of standardization and preparation of digital folktale metadata for scientific use and improving avail- ability of the data for humanities and, more specifically, folktale research. The Dutch Folktale Database has been used as case study but, since these problems are common in all corpora with manually created metadata, the explanation of the process is kept as general as possible.

14 November 2012 (Wednesday, 13.45h. in CR-3E)
Thijs Westerveld (Teezir, Utrecht)
Analysing Online Sentiments: Big Data, Small Building Blocks
The term big data has become mainstream to the point that its showing up in lists of most annoying management buzzwords. Teezir helps its customers to find value in big data beyond the hype. By collecting and analysing almost half a million documents on a daily basis and ordering, summarizing and aggregating the gathered information, we turn big data into valuable insights. To process a continuous stream of Tweets, Facebook updates, forum and blog posts and online and offline news articles we have developed a series of building blocks. In this talk I will discuss some of these including our smart crawlers that learn which links to follow based on user interaction, our text analysis components to detect the language and sentiment of a document, and the index structures we use to quickly produce suitable aggregates in a faceted search like fashion. To conclude, I will give a demonstration of our analytics dashboards and show some examples of how our customers interact with this data and how they incorporate our technology in their daily process.

20 November 2012
Sergio Duarte
Query Recommendation for Children
In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children. We evaluate our method using a large query log sample of queries aimed at retrieving information for children. We show that our method outperforms query suggestions of state-of-the-art search engines and state-of-the art query suggestions based on random walks.

27 November 2012
Mohammad Khelghati
Size Estimation of Non-Cooperative Data Collections
In this paper, the approaches for estimating the size of non-cooperative databases and search engines are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.

18 December 2012
Fabian Panse (University of Hamburg)
Indeterministic Handling of Uncertain Decisions in Deduplication
In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
Read more

See also: 2011's DB colloquium.