Remko Nolten graduates on automatic hyperlinking

WikiLink: Anchor Detection and Link Generation in Wiki’s

by Remko Nolten

In this research we try to automate the process of link generation in Wiki’s by looking at existing link generation techniques and enhancing these with our own ideas. We started the research by analyzing a large document corpus to find out more about the links we want to create. In our analysis we looked into three aspects of our datasets. First, we wanted to know more about the relation between the text that is used to display the link and the title of the page where the link points to. We showed that a large majority of the links could theoretically be identified by matching the text of the link with the page title of the appropriate page, but we also identified several problems with this approach. Second, we wanted to learn more about the existing link structure in our dataset. Here, we confirmed most advantages and disadvantages of using existing links in a link generation algorithm that were also identified by other studies. Finally, we decided to analyze the grammatical structure of links, to see if we could use this later on. Our analysis showed that a very large majority of the links were nouns or noun phrases, which suggests that this would be a good way to identify links in a text.

Based on the results of this analysis, we built a framework in which we could implement new and existing methods for link generation. In the framework, the process of ‘anchor detection’ (the technique of discovering phrases in a larger text that could be used a basis for a link) and ‘destination finding’ (the process of finding a suitable destination page for a short piece of text) where separated. This way we could try multiple combinations to see which would work best. Using this framework, we found that our grammar based anchor detection algorithm combined with multiple destination finding algorithms resulted in the best performance. Our final performance figures were better than most competitors which showed the potential of our techniques.

Matthijs Ooms graduates on Provenance Management for Bioinformatics

by Matthijs Ooms

Scientific Workflow Managements Systems (SWfMSs), such as our own research prototype e-BioFlow, are being used by bioinformaticians to design and run data-intensive experiments, connecting local and remote (Web) services and tools. Preserving data, for later inspection or reuse, determine the quality of results. To validate results is essential for scientific experiments. This can all be achieved by collecting provenance data. The dependencies between services and data are captured in a provenance model, such as the interchangeable Open Provenance Model (OPM). This research consists of the following two provenance related goals:

  1. Using a provenance archive effectively and efficiently as cache for workflow tasks.
  2. Designing techniques to support browsing and navigation through a provenance archive.

The use case identified is called OligoRAP, taken from the life science domain. OligoRAP is casted as a workflow in the SWfMS e-BioFlow. Its performance in terms of duration was measured and its results validated by comparing them to the results of the original Perl implementation. By casting OligoRAP as a workflow and using parallelism, its performance is improved by a factor two.

Many improvements were made to e-BioFlow in order to run OligoRAP, among which a new provenance implementation based on the OPM, enabling provenance capturing during the execution of OligoRAP in e-BioFlow. During this research, e-BioFlow has grown from a proof-of-concept to a powerful research prototype. For the OPM implementation, a profile for the OPM to collect provenance data during workflow execution has been proposed, that defines how provenance is collected during workflow enactment. The proposed profile maintains the hierarchical structure of (sub)workflows in the collected provenance data. With this profile, interoperability of the OPM for SWfMS is improved. A caching strategy is proposed for caching workflow tasks and is implemented in e-BioFlow. It queries the OPM implementation for previous task executions. The queries are optimised by formulating them differently and creating several indices. The performance improvement of each optimisation was measured using a query set taken from an OligoRAP cache run. Three tasks in OligoRAP were cached, resulting in an additional performance improvement of 19%. A provenance archive based on the OPM can be used to effectively cache workflow tasks. A provenance browser is introduced that incorporates several techniques to help browsing through large provenance archives. Its primary visualisation is the graph representation specified by the OPM.

More information at the e-BioFlow project page at SourceForge, or in Matthijs' master thesis in ePrints.

Concept Detectors: How Good is Good Enough?

A Monte Carlo Simulation Based Approach

by Robin Aly and Djoerd Hiemstra

Today, semantic concept based video retrieval systems often show insufficient performance for real-life applications. Clearly, a big share of the reason is the lacking performance of the detectors of these concepts. While concept detectors are on their endeavor to improve, following important questions need to be addressed: “How good do detectors need to be to produce usable search systems?” and “How does the detector performance influence different concept combination methods?”. We use Monte Carlo Simulations to provide answers to the above questions. The main contribution of this paper is a probabilistic model of detectors which outputs confidence scores to indicate the likelihood of a concept to occur. This score is also converted into a posterior probability and a binary classification. We investigate the influence of changes to the model’s parameters on the performance of multiple concept combination methods. Current web search engines produce a mean average precision (MAP) of around 0.20. Our simulation reveals that the best performing video search methods achieve this performance using detectors with 0.60 MAP and is therefore usable in real-life. Furthermore, perfect detection allows the best performing combination method to produce 0.39 search MAP in a artificial environment with Oracle settings. We also find that MAP is not necessarily a good evaluation measure for concept detectors since it is not always correlated with search performance.

The paper will be presented at ACM Multimedia in Bejing, China.

[download pdf]

Robin Aly presents at SIGIR Doctoral Consortium

Modeling Uncertainty in Video Retrieval: A Retrieval Model for Uncertain Semantic Representations of Videos

by Robin Aly

The need for content based multimedia retrieval increases rapidly because of ever faster growing collection sizes. However, retrieval systems often do not perform well enough for real-life applications. A promising approach is to detect semantic primitives at indexing time. Currently investigated primitives are: the uttering of the words and the occurrence of so-called semantic concepts, such as “Outdoor” and “Person”. We refer to a concrete instantiation of these primitives as the representation of the video document. Most detector programs emit scores reflecting the likelihood of each primitive. However, the detection is far from perfect and a lot of uncertainty about the real representation remains. Some retrieval algorithms ignore this uncertainty, which clearly hurts precision and recall. Other methods use the scores as anonymous features and learn their relationship to relevance. This has the disadvantage of requiring vast amounts of training data and has to be redone for every detector change.

The main contribution of our work is a formal retrieval model of treating this uncertainty. We conceptually consider the retrieval problem as two steps: (1) the determination of the posterior probability distribution given the scores over all representations (using existing methods) and (2) the derivation of a ranking status value (RSV) for each representation. We then take the expected RSV weighted by the respresentation’s posterior probability as the effective RSV of this shot for ranking. We claim that our approach has following advantages: (a) that step (2) is easier achieved than using the machine learning alternative and (b) that it benefits from all detector improvements.

[more information]

Parallel and Distributed Databases at Euro-Par 2009

by Djoerd Hiemstra, Alfons Kemper, Manuel Prieto, and Alex Szalay

Euro-Par is an annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel and distributed computing. Euro-Par focuses on all aspects of hardware, software, algorithms and applications for parallel and distributed computing. The Euro-Par 2009 conference will take place in Delft, the Netherlands, from August 25th until August 28th, 2009.

Sessions at Euro-Par cover several topics. Euro-Par Topic 5 addresses data management issues in parallel and distributed computing. Advances in data management (storage, access, querying, retrieval, mining) are inherent to current and future information systems. Today, accessing large volumes of information is a reality: Data-intensive applications enable huge user communities to transparently access multiple pre-existing autonomous, distributed and heterogeneous resources (data, documents, images, services, etc.). Data management solutions need efficient techniques for exploiting and mining large datasets available in clusters, peer to peer and Grid architectures. Parallel and distributed file systems, databases, data warehouses, and digital libraries are a key element for achieving scalable, efficient systems that will cost-effectively manage and extract data from huge amounts of highly distributed and heterogeneous digital data repositories. Each paper submitted to Euro-Par’s topic Parallel and Distributed Databases was reviewed by at least three reviewers. Of 11 papers submitted to the topic this year, 3 were accepted, which makes an acceptance rate of 27 %. The three accepted papers discuss diverse issues: database transactions, efficient and reliable structured peer-to-peer systems, and selective replicated declustering.

In their paper Unifying Memory and Database Transactions, Ricardo Dias, and João Lourenço present a simple but powerful idea: Combining software transactional memory with database transactions. The paper proposes to provide unified memory and database transactions by integrating the database transaction control with a software framework for transactional memory. Experimental results show that the overhead for unified transactions is low. It is likely that the approach lowers the burden on the application developer. The paper by Hao Gong, Guangyu Shi, Jian Chen, and Lingyuan Fan, A DHT Key-Value Storage System with Carrier Grade Performance, tries to achieves reliability and efficiency in peer-to-peer systems in order so support Telecom services. The proposed design is based on: Adopting a two-layer distributed hash table, embedding location information into peer IDs; Providing one-hop routing by enhancing each peer with an additional one-hop routing table, where super-peers are in charge of updating and synchronizing this routing information. Finally, the approach replicates subscriber data on multiple peers. Finally, Kerim Oktay, Ata Turk, and Cevdet Aykanat present a paper on Selective Replicated Declustering for Arbitrary Queries. The authors present a new algorithm for selective replicated declustering for arbitrary queries. The algorithm makes use of query information available in order to decide on the data assignment to different disks and on which data to replicate respecting space constraints. Further, it is described how to apply the proposed algorithm in a recursive way for obtaining a multi-way replicated declustering. Experiments show the algorithm outperforms existing replicated declustering schemes, especially for low replication constraints.

[download pdf]

More info at the Euro-Par 2009 conference site.

Towards an Information Retrieval Theory of Everything

I present three well-known probabilistic models of information retrieval in tutorial style: The binary independence probabilistic model, the language modeling approach, and Google’s page rank. Although all three models are based on probability theory, they are very different in nature. Each model seems well-suited for solving certain information retrieval problems, but not so useful for solving others. So, essentially each model solves part of a bigger puzzle, and a unified view on these models might be a first step towards an Information Retrieval Theory of Everything.

The paper is published in the news letter of the NVTI, the “Nederlandse Vereniging voor Theoretische Informatica”. A more extensive overview of information retrieval theory, covering eight models is given in: Djoerd Hiemstra, Information Retrieval Models. In: Ayse Goker and John Davies (eds.), Information Retrieval: Searching in the 21st Century, Wiley, 2009.

[download pdf]

Reusing Annotation Labor for Concept Selection

by Robin Aly, Djoerd Hiemstra and Arjen de Vries

Describing shots through the occurrence of semantic concepts is the first step towards modeling the content of a video semantically. An important challenge is to automatically select the right concepts for a given information need. For example, systems should be able to decide whether the concept “Outdoor” should be included into a search for “Street Basketball”. In this paper we provide an innovative method to automatically select concepts for an information need. To achieve this, we provide an estimation for the occurrence probability of a concept in relevant shots, which helps us to quantify the helpfulness of a concept. Our method reuses existing training data which is annotated with concept occurrences to build a text collection. Searching in this collection with a text retrieval system and knowing about the concept occurrences allows us to come up with a good estimate for this probability. We evaluate our method against a concept selection benchmark and search runs on both the TRECVID 2005 and 2007 collections. These experiments show that the estimation consistently improves retrieval effectiveness.

[download pdf]

Sander Bockting graduates on collection selection for distributed web search

Using Highly Discriminative Keys, Query-driven Indexing and ColRank

Current popular web search engines, such as Google, Live Search and Yahoo!, rely on crawling to build an index of the World Wide Web. Crawling is a continuous process to keep the index fresh and generates an enormous amount of data traffic. By far the largest part of the web remains unindexed, because crawlers are unaware of the existence of web pages and they have difficulties crawling dynamically generated content.

These problems were the main motivation to research distributed web search. We assume that web sites, or peers, can index a collection consisting of local content, but possibly also content from other web sites. Peers cooperate with a broker by sending a part of their index. Receiving indices from many peers, the broker gains a global overview of the peers’ content. When a user poses a query to a broker, the broker selects a few peers to which it forwards the query. Selected peers should be promising to create a good result set with many relevant documents. The result sets are merged at the broker and sent to the user. This research focuses on collection selection, which corresponds to the selection of the most promising peers. The use of highly discriminative keys is employed as a strategy to select those peers. A highly discriminative key is a term set which is an index entry at the broker. The key is highly discriminative with respect to the collections because the posting lists pointing to the collections are relatively small. Query-driven indexing is applied to reduce the index size by only storing index entries that are part of popular queries. A PageRank-like algorithm is also tested to assign scores to collections that can be used for ranking. The Sophos prototype was developed to test these methods. Sophos was evaluated on different aspects, such as collection selection performance and index sizes. The performance of the methods is compared to a baseline that applied language modeling onto merged documents in collections. The results show that Sophos can outperform the baseline with ad-hoc queries on a web based test set. Query-driven indexing is able to substantially reduce index sizes against a small loss in collection selection performance. We also found large differences in the level of difficulty to answer queries on various corpus splits.

More information in [E-Prints]

WikiTranslate: translations using only Wikipedia

WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia

by Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf Trieschnigg, Djoerd Hiemstra and Franciska de Jong

This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations. Queries are mapped to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. WikiTranslate is evaluated by searching with topics formulated in Dutch, French and Spanish in an English data collection. The system achieved a performance of 67% compared to the monolingual baseline.

[download pdf]

The technology behind StreetTiVo

StreetTiVo: Using a P2P XML Database System to Manage Multimedia Data in Your Living Room

by Ying Zhang, Arjen de Vries, Peter Boncz, Djoerd Hiemstra, and Roeland Ordelman

StreetTiVo is a project that aims at bringing research results into the living room; in particular, a mix of current results in the areas of Peer-to-Peer XML Database Management System (P2P XDBMS), advanced multimedia analysis techniques, and advanced information retrieval techniques. The project develops a plug-in application for the so-called Home Theatre PCs, such as set-top boxes with MythTV or Windows Media Center Edition installed, that can be considered as programmable digital video recorders. StreetTiVo distributes computeintensive multimedia analysis tasks over multiple peers (i.e., StreetTiVo users) that have recorded the same TV program, such that a user can search in the content of a recorded TV program shortly after its broadcasting; i.e., it enables near real-time availability of the meta-data (e.g., speech recognition) required for searching the recorded content. Street- TiVo relies on our P2P XDBMS technology, which in turn is based on a DHT overlay network, for distributed collaborator discovery, work coordination and meta-data exchange in a volatile WAN environment. The technologies of video analysis and information retrieval are seamlessly integrated into the system as XQuery functions.

The paper will be presented at the Joint International Conferences on Asia-Pacific Web Conference (APWeb) and Web-Age Information Management (WAIM) on 1-4 April, 2009 in Suzhou, China

[download pdf]