Matthijs Ooms graduates on Provenance Management for Bioinformatics

by Matthijs Ooms

Scientific Workflow Managements Systems (SWfMSs), such as our own research prototype e-BioFlow, are being used by bioinformaticians to design and run data-intensive experiments, connecting local and remote (Web) services and tools. Preserving data, for later inspection or reuse, determine the quality of results. To validate results is essential for scientific experiments. This can all be achieved by collecting provenance data. The dependencies between services and data are captured in a provenance model, such as the interchangeable Open Provenance Model (OPM). This research consists of the following two provenance related goals:

  1. Using a provenance archive effectively and efficiently as cache for workflow tasks.
  2. Designing techniques to support browsing and navigation through a provenance archive.

The use case identified is called OligoRAP, taken from the life science domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its performance in terms of duration was measured and its results validated by comparing them to the results of the original Perl implementation. By casting OligoRAP as a workflow and using parallelism, its performance is improved by a factor two.

Many improvements were made to e-BioFlow in order to run OligoRAP, among which a new provenance implementation based on the OPM, enabling provenance capturing during the execution of OligoRAP in e-BioFlow. During this research, e-BioFlow has grown from a proof-of-concept to a powerful research prototype. For the OPM implementation, a profile for the OPM to collect provenance data during workflow execution has been proposed, that defines how provenance is collected during workflow enactment. The proposed profile maintains the hierarchical structure of (sub)workflows in the collected provenance data. With this profile, interoperability of the OPM for SWfMS is improved. A caching strategy is proposed for caching workflow tasks and is implemented in e-BioFlow. It queries the OPM implementation for previous task executions. The queries are optimised by formulating them differently and creating several indices. The performance improvement of each optimisation was measured using a query set taken from an OligoRAP cache run. Three tasks in OligoRAP were cached, resulting in an additional performance improvement of 19%. A provenance archive based on the OPM can be used to effectively cache workflow tasks. A provenance browser is introduced that incorporates several techniques to help browsing through large provenance archives. Its primary visualisation is the graph representation specified by the OPM.

More information at the e-BioFlow project page at SourceForge, or in Matthijs' master thesis in ePrints.

Annual SIKS day

On November 16, 2009, the School for Information and Knowledge Systems (SIKS) organizes its annual SIKS-day. The location will be Grand Hotel Karel V in Utrecht.

The main aim of the event is to give SIKS-members, participating in research groups all over the country, the opportunity to meet each other in an informal setting and to inform them about current developments and some new activities and plans for the coming year. This year we also celebrate the fact that our school has been re-accredited by KNAW this summer for another period of six years. A small scientific symposium will be organized at the SIKS-day as well.


10.00 – 10.30 Coffee and Tea
10.30 – 10.45 Welcome by Hans Akkermans (VU), Chair Board of Governors SIKS
10.45 – 11.30 “Self-organizing access and storage of correlated data: crossing the frontier of structured data management” by Peter Boncz (CWI)
11.30 – 12.15 “Knowlege Representation on the Web: what to do when success is becoming a problem?” by Frank van Harmelen (VU)
12.15 – 12.30 “SIKS: Accreditation Outcome and Future Plans” by Roel Wieringa (UT), Scientific Director SIKS
12.30 – 13.45 Lunch
13.45 – 14.00 “The SIKS PhD Advisory Board: Going where the wild things are” by Nieske Vergunst (UU), Chair Phd Advisory Board
14.00 – 14.45 “The Minimum Description Length Principle for Data Mining” by Arno Siebes (UU)
14.45 – 15.15 Coffee and Tea
15.15 – 16.00 “On the Usability of Business Process Models: A Formal and Empirical Approach” by Hajo Reijers (TUE)
16.00 – 16.45 “Professor Kripke, let me introduce Professor Nash: Logic for Economic Mechanism Design” by Michael Wooldridge (Liverpool, UK)
16.45 – 18.00 Drinks

Relying on Topic Subsets for System Ranking Estimation

by Claudia Hauff, Djoerd Hiemstra, Franciska de Jong, and Leif Azzopardi

Ranking a number of retrieval systems according to their retrieval effectiveness without relying on costly relevance judgments was first explored by Soboroff et al. [6]. Over the years, a number of alternative approaches have been proposed. We perform a comprehensive analysis of system ranking estimation approaches on a wide variety of TREC test collections and topics sets. Our analysis reveals that the performance of such approaches is highly dependent upon the topic or topic subset, used for estimation. We hypothesize that the performance of system ranking estimation approaches can be improved by selecting the “right” subset of topics and show that using topic subsets improves the performance by 32% on average, with a maximum improvement of up to 70% in some cases.

The paper will be presented at the 18th ACM Conference on Information and Knowledge Management (CIKM 2009) in Hong Kong, China

[download pdf]

DIR 2010 will be in Nijmegen

The Dutch-Belgian Information Retrieval (DIR 2010) will take place at the Radboud University Nijmegen, the Netherlands, on January 25-26, 2010. The primary aim of the DIR is to provide an international meeting place where researchers from the domain of information retrieval and related disciplines can exchange information and present innovative research developments.

Important dates
Paper submission deadline: November 13, 2009
Notification of acceptance: December 11, 2009
Conference at Radboud University Nijmegen: January 25-26, 2010

Call for papers at DIR 2010.

Information Retrieval Models Tutorial

Many applications that handle information on the internet would be completely inadequate without the support of information retrieval technology. How would we find information on the world wide web if there were no web search engines? How would we manage our email without spam filtering? Much of the development of information retrieval technology, such as web search engines and spam filters, requires a combination of experimentation and theory. Experimentation and rigorous empirical testing are needed to keep up with increasing volumes of web pages and emails. Furthermore, experimentation and constant adaptation of technology is needed in practice to counteract the effects of people that deliberately try to manipulate the technology, such as email spammers. However, if experimentation is not guided by theory, engineering becomes trial and error. New problems and challenges for information retrieval come up constantly. They cannot possibly be solved by trial and error alone. So, what is the theory of information retrieval? There is not one convincing answer to this question. There are many theories, here called formal models, and each model is helpful for the development of some information retrieval tools, but not so helpful for the development others. In order to understand information retrieval, it is essential to learn about these retrieval models. In this chapter, some of the most important retrieval models are gathered and explained in a tutorial style.

The tutorial will be published in Ayse Goker and John Davies (eds.), Information Retrieval: Searching in the 21st Century, Wiley, 2009.

[download draft]

[download exercise solutions]

Distributed data processing using MapReduce

Distributed data processing using MapReduce is a new course that teaches how to carry out large-scale distributed data analysis using Google's MapReduce as the programming abstraction. MapReduce is a programming abstraction that is inspired by the functions 'map' and 'reduce' as found in functional programming language such as Lisp. It was developed at Google as a mechanism to allow large-scale distributed processing of data on data centers consisting of thousands of low-cost machines. MapReduce allows programmers to distribute their programs over many machines without the need to worry about system failures, threads, locks, semaphores, and other concepts from concurrent and distributed programming. Students will learn to specify algorithms using map and reduce steps and to implement these algorithms using Hadoop, an open source implementation of Google's file system and MapReduce. The course will introduce recent attempts to develop high-level languages for simplified relational data processing on top of Hadoop, such as Yahoo's Pig Latin and Microsoft's DryadLINQ.

The course consists of lectures and practical assignments. Students will solve lab exercises on a large cluster of machines in order to get hands-on experience and solve real large-scale problems. The lab exercises will be done on the University of Twente PRISMA-2 computer, a data center consisting of 16 dual core systems sponsored by Yahoo Research. Examples of lab exercises are: counting bigrams in large web crawls, inverted index construction, and the computation of Google's PageRank. After successful completion of the course, the student is able to:

  • Disect complex problems in algorithms that use map and reduce steps,
  • Specify these algorithms in a functional language such as Haskell,
  • Implement these algorithms using the Hadoop framework,
  • Specify simplified relational queries using Pig Latin.

More information at Blackboard.

Russian Summer School in Information Retrieval

I will give several lectures on information retrieval modeling at the Russian Summer School in Information Retrieval, which will be held September 11-16, 2009 in Petrozavodsk, Russia. The main audience of the school is graduate and post-graduate students, young scientists and professionals who have experience in development of information retrieval applications. The school will host approximately 100 participants.

Information Retrieval Modeling
There is no such thing as a dominating model or theory of information retrieval, unlike the situation in for instance the area of databases where the relational model is the dominating database model. In information retrieval, some models work for some applications, whereas others work for other applications. For instance, vector space models are well-suited for similarity search and relevance feedback in many (also non-textual) situations if a good weighting function is available; the probabilistic retrieval model or naive Bayes model might be a good choice if examples of relevant and nonrelevant documents are available; Google's PageRank model is often used in situations that need modelling of more of less static relations between documents; region models have been designed to search in structured text; and language models are helpful in situations that require models of language similarity or document priors; In this tutorial, I carefully describe all these models by explaining the consequences of modelling assumptions. I address approaches based on statistical language models in great depth. After the course, students are able to choose a model of information retrieval that is adequate in new situations, and to apply the model in practice.

More information at RuSSIR 2009.

Concept Detectors: How Good is Good Enough?

A Monte Carlo Simulation Based Approach

by Robin Aly and Djoerd Hiemstra

Today, semantic concept based video retrieval systems often show insufficient performance for real-life applications. Clearly, a big share of the reason is the lacking performance of the detectors of these concepts. While concept detectors are on their endeavor to improve, following important questions need to be addressed: “How good do detectors need to be to produce usable search systems?” and “How does the detector performance influence different concept combination methods?”. We use Monte Carlo Simulations to provide answers to the above questions. The main contribution of this paper is a probabilistic model of detectors which outputs confidence scores to indicate the likelihood of a concept to occur. This score is also converted into a posterior probability and a binary classification. We investigate the influence of changes to the model’s parameters on the performance of multiple concept combination methods. Current web search engines produce a mean average precision (MAP) of around 0.20. Our simulation reveals that the best performing video search methods achieve this performance using detectors with 0.60 MAP and is therefore usable in real-life. Furthermore, perfect detection allows the best performing combination method to produce 0.39 search MAP in a artificial environment with Oracle settings. We also find that MAP is not necessarily a good evaluation measure for concept detectors since it is not always correlated with search performance.

The paper will be presented at ACM Multimedia in Bejing, China.

[download pdf]