SARA organizes Hadoop hackathon

On December 7, SARA (the Dutch National High Performance Computing and e-Science Support Center) organizes a day-long hackathon to kick-off a Proof-of-Concept Hadoop service, and give the opportunity to experiment with Hadoop with support of experienced users. People who are interested can work with Hadoop on a case of choice, or only play with datasets like Wikipedia, the ENRON dataset, White House visitor records, Genome data or others.

See: SARA starts Apache Hadoop Proof-of-Concept.

What’s cooking in the OLC-IT?

The University Curriculum Committee IT (opleidingcommissie IT or OLC-IT) discusses the bachelor “Informatica” and “Telematica” and the master Computer Science, Human Media Interaction, and Telematics. The committee's 50th meeting was celebrated by cooking Italian dishes at Kook & Co.

OLC-IT

Standing from left to right: Paul Havinga, Marieke Huisman, Georgios Karagiannis, Betsy van Dijk, Johan Noltes, Sabine Padberg, Ralph Broenink, Beer Sijpestijn, Bas Stottelaar, Ruud Verbij, Jan Schut, Hans Romkema. Sitting from left to right: Rom Langerak, Djoerd Hiemstra, Gerrit van der Hoeven.

University of Twente at TREC 2010

MapReduce for Experimental Search

by Djoerd Hiemstra and Claudia Hauff

This draft report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam removal.

[download pdf]

Guest lecture by Arjen de Vries

How search logs can help improve future searches

In the European project Vitalas, we had the opportunity to analyze the search log data from a commercial picture portal of a European news agency, which offers access to photographic images to professional users. I will discuss how these logs can be used in various ways to improve image search: to expand the image representation, to make suggestions of alternative queries, to adapt the search results to user context, and to build automatically concept detectors for content-based image retrieval. I also present recent work on using the semantic information that has become publicly available in the form of linked data to improve the search log analysis. The results show that bringing in linked data gives insights beyond the more common term-based analysis, since queries related in the most frequent ways do not usually share terms. I conclude with a discussion of the implications of our findings for improving log analysis, image collection management, and search engine design.

The guest lecture takes place on 20 October 2010 at 13.45 h. in ZI-2126.

Welcome to the MapReduce course

Welcome to Distributed Data Processing using MapReduce

This will be a course that is on top of some very exciting developments in cloud computing and data centers, initiated by Google, and followed by many others such as Yahoo, Amazon, AOL, Baidu, Joost, Mylife, Facebook, etc., etc. The course is about processing terabytes of data on large clusters. But not only that, not many courses in the master’s Computer Science will be so “core computer science”: We will discuss new file systems (GFS and Hadoop FS), new programming paradigms (MapReduce), new programming languages and query languages (Sawzall, Pig Latin), and new Database paradigms (BigTable, Cassandra and Dynamo), and of course many web search and data mining applications that made Google one of today’s leading IT companies.

We hope to see you at our lectures on Friday’s 3/4 hour.
Robin Aly, Maarten Fokkinga, and Djoerd Hiemstra.

DIR 2011 in Amsterdam

DIR 2011 will be held on 4 February 2011 at the University of Amsterdam. You are invited to submit contributions to the Dutch-Belgian Information Retrieval workshop (DIR 2011). The primary aim of DIR is to provide an international meeting place where researchers from the domain of information retrieval and related disciplines can exchange information and present innovative research developments. DIR 2011 will put special emphasis on interaction – by focusing on poster presentations and creating space and time to meet and discuss new ideas.

More information on the DIR 2011 web site.

Guest lecture by Thijs Westerveld

Automatically Analyzing Word of Mouth

Thijs Westerveld from Teezir B.V., Utrecht, will give a guest lecture on 6 October 2010 in ZI-2126. Teezir uses advanced search technology to aggregate views and opinions found on review sites, in discussion groups or blogs. This way, we create statistics and interpretations about what people are saying. Querying this data allows decision makers to slice and dice the content, and learn what people say, either at the very aggregated level: “what is the share of positive versus negative views about our new product?”, or at the very detailed level: “which sources reflect this negative sentiment, and what exactly are people saying?”

Who Rules ruler In this talk I will demonstrate Teezir’s Opinion Analysis dashboards and discuss the underlying technology. For collecting content from web sites we developed advanced crawling technology that automatically identifies relevant news, blog and forum pages and extracts the relevant content and metadata. The collected content is then further analyzed to identify the main sentiments before everything is indexed to be disclosed in the online dashboards. Various sentiment analysis variants that have proven successful in an academic setting have been evaluated on our live collections. I will demonstrate that success on academic test collections does not necessarily imply the practical use of a sentiment analysis algorithm.

See also: Who rules?

Dolf Trieschnigg defends PhD thesis on Biomedical IR

Proof of Concept: Concept-based Biomedical Information Retrieval

by Dolf Trieschnigg

In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. Biomedical IR is concerned with the disclosure of these vast amounts of written knowledge. Biomedical IR is not only important for end-users, such as biologists, biochemists, and bioinformaticians searching directly for relevant literature but also plays an important role in more sophisticated knowledge discovery. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Multiple synonymous terms can be used for single biomedical concepts, such as genes and diseases. Conversely, single terms can be ambiguous, and may refer to multiple concepts. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge in modern word-based information retrieval is, however, far from trivial. This thesis investigates the problem of handling biomedical terminology based on three research themes.

The first research theme deals with robust word-based retrieval. Effective retrieval models commonly use a word-based representation for retrieval. As so many spelling variations are present in biomedical text, the way in which these word-based representations are obtained affect retrieval effectiveness. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. This investigation included stop-word removal, stemming, different approaches to breakpoint identification and normalisation, and character n-gramming. In particular breakpoint identification and normalisation (that is determining word parts in biomedical compounds) showed a strong effect on retrieval performance. A combination of effective preprocessing heuristics was identified and used to obtain word-based representations from text for the remainder of this thesis.

The second research theme deals with concept-based retrieval. We investigated two representation vocabularies for concept-based indexing, one based on the Medical Subject Headings thesaurus, the other based on the Unified Medical Language System metathesaurus extended with a number of gene and protein dictionaries.

We investigated the following five topics.

  1. How documents are represented in a concept-based representation.
  2. To what extent such a document representation can be obtained automatically.
  3. To what extent a text-based query can be automatically mapped onto a concept-based representation and how this affects retrieval performance.
  4. To what extent a concept-based representation is effective in representing information needs.
  5. How the relationship between text and concepts can be used to determine the relatedness of concepts.

We compared different classification systems to obtain concept-based document and query representations automatically. We proposed two classification methods based on statistical language models, one based on K-Nearest Neighbours (KNN) and one based on Concept Language Models (CLM).

For a selection of classification systems we carried out a document classification experiment in which we investigated to what extent automatic classification could reproduce manual classification. The proposed KNN system performed well in comparison to the out-of-the-box systems. Manual analysis indicated the improved exhaustiveness of automatic classification over manual classification. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In an artificial setting, we compared the optimal retrieval performance which could be obtained with word-based and concept-based representations. Contrary to our intuition, on average a single word-based query performed better than a single concept-based representation, even when the best concept term precisely represented part of the information need.

We investigated to what extent the relatedness between pairs of concepts as indicated by human judgements could be automatically reproduced. Results on a small test set indicated that a method based on comparing concept language models performed particularly well in comparison to systems based on taxonomy structure, information content and (document) association.

In the third and last research theme of this thesis we propose a framework for concept-based retrieval. We approached the integration of domain knowledge in monolingual information retrieval as a cross-lingual information retrieval (CLIR) problem. Two languages were identified in this monolingual setting: a word-based representation language based on free text, and a concept-based representation language based on a terminological resource. Similar to what is common in traditional CLIR, queries and documents are translated into the same representation language and matched. The cross-lingual perspective gives us the opportunity to adopt a large set of established CLIR methods and techniques for this domain. In analogy to established CLIR practice, we investigated translation models based on a parallel corpus containing documents in multiple representations and translation models based on a thesaurus. Surprisingly, even the integration of very basic translation models showed improvements in retrieval effectiveness over word-only retrieval. A translation model based on pseudo-feedback translation was shown to perform particularly well. We proposed three extensions to a basic cross-lingual retrieval model which, similar to previous approaches in established CLIR, improved retrieval effectiveness by combining multiple translation models. Experimental results indicate that, even when using very basic translation models, monolingual biomedical IR can benefit from a cross-lingual approach to integrate domain knowledge.

[download pdf]

A Cross-lingual Framework for Monolingual Biomedical Information Retrieval

by Dolf Trieschnigg, Djoerd Hiemstra, Franciska de Jong, and Wessel Kraaij

An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed to deal with this challenge. In this paper, we approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. The approach allows for deployment of a rich set of techniques proposed and evaluated in traditional cross-lingual IR. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a cross-lingual IR framework for monolingual biomedical IR if basic translations models are combined.

The paper will be presented at the 19th ACM International Conference on Information and Knowledge Management on October 26-30 in Toronto, Canada.