SIKS/Twente Seminar on Biomedical Text Mining

On 1 September 2010, we organize a small seminar on Biomedical Text Mining at the University of Twente. Invited speakers are:

  • Martijn Schuemie (Erasmus MC/LUMC, Rotterdam, Netherlands)
  • Dietrich Rebholz-Schuhmann (European Bioinformatics Institute, UK)

The workshop will take place at the campus of the University of Twente at the small lecture hall of the Vrijhof (building 47). The event is sponsored by the Netherlands research School for Information and Knowledge Systems (SIKS) and the Centre for Telematics and Information Technology (CTIT). Please visit the SSR-5 home page for more information.

DBDBD 2010 in Hasselt, Belgium

This year, the Dutch Belgian Database Day (DBDBD) will be held in Hasselt, Belgium, on Monday November 22nd, 2010. DBDBD is a yearly one-day workshop organized in a Belgian or Dutch university, whose general topic is database research. DBDBD invites submissions (1 page abstract) on a broad range of database and database-related topics, including but not limited to data storage and management, theoretical database issues, database performance, data mining, information retrieval, data semantics, querying, ontologies etc. Based on the submissions, the workshop will be organized in different sessions each covering a particular topic.

At the DBDBD, junior researchers from the Netherlands and Belgium can present their recent results. It is an excellent opportunity to meet up with your Belgian/Dutch colleagues, and to get informed about the (recent) database-related research performed in Belgian/Dutch universities. The workshop is also open to non-Belgian/Dutch participants (presentations are in English). Participation is free for all SIKS-members (Phd-students, research fellows, senior research fellows and associated members).

DBDBD 2010 web site

Robin Aly defends PhD thesis on uncertainty in concept-based multimedia retrieval

by Robin Aly

This thesis considers concept-based multimedia retrieval, where documents are represented by the occurrence of concepts (also referred to as semantic concepts or high-level features). A concept can be thought of as a kind of label, which is attached to (parts of) the multimedia documents in which it occurs. Since concept-based document representations are user, language and modality independent, using them for retrieval has great potential for improving search performance. As collections quickly grow both in volume and size, manually labeling concept occurrences becomes infeasible and the so-called concept detectors are used to decide upon the occurrence of concepts in the documents automatically.

The following fundamental problems in concept-based retrieval are identified and addressed in this thesis. First, the concept detectors frequently make mistakes while detecting concepts. Second, it is difficult for users to formulate their queries since they are unfamiliar with the concept vocabulary, and setting weights for each concept requires knowledge of the collection. Third, for supporting retrieval of longer video segments, single concept occurrences are not sufficient to differentiate relevant from non-relevant documents and some notion of the importance of a concept in a segment is needed. Finally, since current detection techniques lack performance, it is important to be able to predict what search performance retrieval engines yield, if the detection performance improves.

The main contribution of this thesis is the uncertain document representation ranking framework (URR). Based on the Nobel prize winning Portfolio Selection Theory, the URR framework considers the distribution over all possible concept-based document representations of a document given the observed confidence scores of concept detectors. For a given score function, documents are ranked by the expected score plus an additional term of the variance of the score, which represents the risk attitude of the system.

User-friendly concept selection is achieved by re-using an annotated development collection. Each video shot of the development collection is transformed into a textual description which yields a collection of textual descriptions. This collection is then searched for a textual query which does not require the user's knowledge of the concept vocabulary. The ranking of the textual descriptions and the knowledge of the concept occurrences in the development collection allows a selection of useful concepts together with their weights.

The URR framework and the proposed concept selection method are used to derive a shot and a video segment retrieval framework. For shot retrieval, the probabilistic ranking framework for unobservable events is proposed. The framework re-uses the well-known probability of relevance score function from text retrieval. Because of the representation uncertainty, documents are ranked by their expected retrieval score given the confidence scores from the concept detectors.

For video segment retrieval, the uncertain concept language model is proposed for retrieving news items — a particular video segment type. A news item is modeled as a series of shots and represented by the frequency of each selected concept. Using the parallel between concept frequencies and term frequencies, a concept language model score function is derived from the language modelling framework. The concept language model score function is then used according to the URR framework and documents are ranked by the expected concept language score plus an additional term of the score's variance.

The Monte Carlo Simulation method is used to predict the behavior of current retrieval models under improved concept detector performance. First, a probabilistic model of concept detector output is defined as two Gaussian distributions, one for the shots in which the concept occurs and one for the shots in which it does not. Randomly generating concept detector scores for a collection with known concept occurrences and executing a search on the generated output estimates the expected search performance given the model's parameters. By modifying the model parameters, the detector performance can be improved and the future search performance can be predicted.

Experiments on several collections of the TRECVid evaluation benchmark showed that the URR framework often significantly improve the search performance compared to several state-of-the-art baselines. The simulation of concept detectors yields that today's video shot retrieval models will show an acceptable performance, once the detector performance is around 0.60 mean average precision. The simulation of video segment retrieval suggests, that this task is easier and will sooner be applicable to real-life applications.

[download pdf]

Bertold van Voorst graduates on collection selection using database clustering

Cluster-based collection selection in uncooperative distributed information retrieval

by Bertold van Voorst

The focus of this research is collection selection for distributed information retrieval. The collection descriptions that are necessary for selecting the most relevant collections are often created from information gathered by random sampling. Collection selection based on an incomplete index constructed by using random sampling instead of a full index leads to inferior results.

In this research we propose to use collection clustering to compensate for the incompleteness of the indexes. When collection clustering is used we do not only select the collections that are considered relevant based on their collection descriptions, but also collections that have similar content in their indexes. Most existing cluster algorithms require the specification of the number of clusters prior to execution. We describe a new clustering algorithm that allows us to specify the sizes of the produced clusters instead of the number of clusters.

Our experiments show that that collection clustering can indeed improve the performance of distributed information retrieval systems that use random sampling. There is not much difference in retrieval performance between our clustering algorithm and the well-known k-means algorithm. We suggest to use the algorithm we proposed because it is more scalable.

[download pdf]

SIGIR 2010 best papers

Ryen White and Jeff Huang received the best paper award at SIGIR 2010 for their paper “Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs”. They present a log-based study estimating the user value of trail following. They demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, including trail recommendation systems that display trails on search result pages.

The best student paper is written by Ioannis Arapakis, Konstantinos Athanasakos, and Joemon Jose: “A comparison of general vs. personalized affective models for the prediction of topical relevance”. They determined whether the behavioural differences of users have an impact on the models' ability to determine topical relevance, and if, by personalising them, accuracy can be improved.

Keith van Rijsbergen retired

Keith van Rijsbergen is retiring this year. To celebrate his long successful career, you can download his book “Information Retrieval” in the popular epub format, an open format that is supported by most e-readers.


Since the publication in 1976 of the first edition of Van Rijsbergen's book, it has established itself as a classic. The book gives a thorough introduction to “automatic ranked” retrieval, which today forms the basis of web search engines, but at that time was still highly experimental. The book covers all important information retrieval topics, but it is Van Rijsbergen's personal view on information retrieval that makes the book so different from other scientific books on information retrieval: The book is written in the first person, a writing style I would normally not recommend for scientific documents. In this book, however, Van Rijsbergen's personal style of writing inspired me a lot. Maybe it is his undisputed expertise, maybe it is his critical analysis of the work of others, or maybe it is merely his enthousiastic account of science, whatever it is, it is a pleasure to read the book, even almost 35 years after its first publications. Here is a nice example, where Van Rijsbergen's shares his view on significance tests:

Keith van RijsbergenUnfortunately, I have to agree with the findings of the Comparative Systems Laboratory in 1968, that there are no known statistical tests applicable to IR. This may sound like a counsel of defeat but let me hasten to add that it is possible to select a test which violates only a few of the assumptions it makes.

His analysis let me to use the paired sign test in my PhD thesis, and I motivated this by adding that Van Rijsbergen says I am allowed to do so. (Actually, he claims I am allowed to do so only conservatively, because some of the test's assumptions are not met…) The book is also a no-nonsense book in many respects, with many practical approaches that are directly applicable. In several of our experiments, we used the stop word list printed in the book (see Table 2.1). This is science in its best form. Experiments should be easily reproducible, and what is more easy than the usage of a officially published stop word list?

So, if you are still looking for a good, personal, entertaining, no-nonsense, scientific book on information retrieval to be read by the pool during the holidays, please consider Information Retrieval. No e-reader yet? Then you can read the ebook using the EPUBReader Firefox addon.

[download epub]

Let’s quickly test this on 12TB of data

MapReduce for Information Retrieval Evaluation

by Djoerd Hiemstra and Claudia Hauff

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at:

The paper will be presented at the CLEF 2010 Conference on Multilingual and Multimodal Information Access Evaluation on 20-23 September 2010 in Padua, Italy

Guest lecture by Alexander Hauptmann at SSR-4

The 4th SIKS/Twente Seminar on Searching and Ranking will take place on 2nd of July at the University of Twente. The goal of the one day seminar is to bring together researchers from companies and academia working on the effectiveness of search engines. Invited speakers are:

  • Alexander Hauptmann (Carnegie Mellon University, USA)
  • Arjen de Vries (CWI and University of Delft, Netherlands)
  • Wessel Kraaij (TNO and Radboud University Nijmegen, Netherlands)

The workshop will take place at the campus of the University of Twente at the Citadel (building 9), lecture hall T300. SSR is sponsored by SIKS and CTIT.

More information at SSR-4.

Expertise centre for cloud computing

Enschede will open an expertise centre for cloud computing on Thursday 17 June. The Centre 4 Cloud Computing will support open innovation and the sharing of knowledge on cloud computing. Cloud computing is an Internet-based computing paradigm, whereby shared resources, software and information are provided on-demand in a highly scalable way.

Cloud computing logical diagram

The expertise centre offers companies and organisations the following:

  1. Knowledge Exchange: To make (applied) knowledge and best practices available to professionals, management and other interested parties
  2. Research: Scientific applied research into technical, security, legal, and business aspects of cloud computing
  3. Commercial: Contribute to business development for companies that offer services based on cloud computing solutions

For more information, see

Tangible Information Retrieval for Children

by Michel Jansen, Wim Bos, Paul van der Vet, Theo Huibers and Djoerd Hiemstra

Despite several efforts to make search engines more child-friendly, children still have trouble using systems that require keyboard input. We present TeddIR: a system using a tangible interface that allows children to search for books by placing tangible figurines and books they like/dislike in a green/red box, causing relevant results to be shown on a display. This way, issues with spelling and query formulation are avoided. A fully functional prototype was built and evaluated with children aged 6-8 at a primary school. The children understood TeddIR to a large extent and enjoyed the playful interaction.

TeddIR in the set-up used during evaluation.

TeddIR will be presented at 9th International Conference on Interaction Design and Children, Barcelona June 9-11, 2010.

[download pdf]