Language models, smoothing and n-grams

A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model based on a big English newspaper archive is expected to assign a higher probability to “a bit of text” than to “aw pit tov tags”, because the words in the former phrase (or word pairs or word triples if so-called n-gram models are used) occur more frequently in the data than the words in the latter phrase. For information retrieval, typical usage is to build a language model for each document. At search time, the top ranked document is the one which’ language model assigns the highest probability to the query.

The term language models originates from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s. Speech recognition systems use a language model to complement the results of the acoustic model which models the relation between words (or parts of words called phonemes) and the acoustic signal. The history of language models, however, goes back to beginning of the 20th century when Andrei Markov used language models (Markov models) to model letter sequences in works of Russian literature. Another famous application of language models are Claude Shannon’s models of letter sequences and word sequences, which he used to illustrate the implications of coding and information theory. In the 1990’s language models were applied as a general tool for several natural language processing applications, such as part-of-speech tagging, machine translation, and optical character recognition. Language models were applied to information retrieval by a number of research groups in the late 1990’s. They became rapidly popular in information retrieval research. By 2001, the ACM SIGIR conference had two separate sessions on language models containing 5 papers in total. In 2003, a group of leading information retrieval researchers published a research road map “Challenges in Information Retrieval and Language Modeling”, indicating that the future of information retrieval and the future of language modeling can not be seen apart from each other.

This entry will soon be published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Özsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia's online version will be accessible on the SpringerLink platform. Click here for more information about the Encyclopedia of Database Systems.


Welcome to Advanced Database Systems

Welcome to ADS! This course will have a little bit of everything: traditional lectures, home work assignments (on a voluntary basis), a small (half a day) practicum assignment, a small project, and a written exam. The course consist of two parts. In the first part, you are going to learn how relational databases work in practice: what happens inside the system when an SQL query is executed? In the 2nd part of the course, we are going to discuss applications for which relational databases do not provide a perfect solution: object-oriented data, semi-structured data, distributed data and online analytical processing.

More info on TeleTOP

Improved query difficulty prediction for the web

by Claudia Hauff, Vanessa Murdock, Ricardo Baeza-Yates

Query performance prediction aims to predict whether a query will have a high average precision given retrieval from a particular collection, or low average precision. An accurate estimator of the quality of search engine results can allow the search engine to decide to which queries to apply query expansion, for which queries to suggest alternative search terms, to adjust the sponsored results, or to return results from specialized collections. In this paper we present an evaluation of state of the art query prediction algorithms, both post-retrieval and pre-retrieval and we analyze their sensitivity towards the retrieval algorithm. We evaluate query difficulty predictors over three widely different collections and query sets and present an analysis of why prediction algorithms perform significantly worse onWeb data. Finally we introduce Improved Clarity, and demonstrate that it outperforms state-of-the-art predictors on three standard collections, including two large Web collections.

The paper will be presented at the ACM Conference on Information and Knowledge Management CIKM 2008 in Napa Valley, USA

[download pdf]

Dutch-Belgian IR workshop in Twente

Vrijhof with famous sunken tower We organize the 9th Dutch-Belgian Information Retrieval Workshop (DIR) in Twente. The event will take place on February 2-3, 2009 in our lovely “Vrijhof” building. The primary aim of Dutch-Belgian Information Retrieval (DIR) workshop is to provide an international meeting place where researchers from the domain of information retrieval and related disciplines, can exchange information and present innovative research developments. The first call for papers is just out. Papers may range from theoretical work to system descriptions. We encourage Ph.D. students to submit their research. We also welcome contributions from industry when they focus on novel research directions. Have a look at the DIR website at:

Multi-step Relevance Propagation for Expert Finding

by Pavel Serdyukov, Henning Rode, and Djoerd Hiemstra

A fragment of the real expertise graph with links between documents white nodes) and candidate experts (black nodes) for query 'sustainable ecosystems' An expert finding system allows a user to type a simple text query and retrieve names and contact information of individuals that possess the expertise expressed in the query. This paper proposes a novel approach to expert finding in large enterprises or intranets by modeling candidate experts (persons), web documents and various relations among them with so-called expertise graphs. As distinct from the state-of-the-art approaches estimating personal expertise through one-step propagation of relevance probability from documents to the related candidates, our methods are based on the principle of multi-step relevance propagation in topic-specific expertise graphs. We model the process of expert finding by probabilistic random walks of three kinds: finite, infinite and absorbing. Experiments on TREC Enterprise Track data originating from two large organizations show that our methods using multi-step relevance propagation improve over the baseline one-step propagation based method in almost all cases.

The paper will be presented at the ACM Conference on Information and Knowledge Management CIKM 2008 in Napa Valley, USA

[download pdf]

SIGIR Salton Award 2009

SIGIR accepts nominations for the Salton Award to honor members of our community who have made …significant, sustained and continuing contributions to research in information retrieval. The Salton award is awarded triennially at the SIGIR Conference: 2009 is a Salton year. The selection committee consists of the available past Salton Award winners. Nominations can be sent to the SIGIR Chair Elizabeth D. Liddy who coordinates the discussion and nomination.

More information at the SIGIR Award Page

Distributed search: who bids on “britney spears”?

NWO We will start a new NWO research project on the use of keyword auctions for distributed information retrieval. The project's aim is to distribute internet search functionality in such a way that communities of users and/or federations of small search systems provide search services in a collaborative way. Instead of getting all data to a centralized point and process queries centrally, as is done by today's search systems, the project will distribute queries over many small autonomous search systems and process them locally. Distributed information retrieval is a well researched sub area of information retrieval, but it has not resulted in practical solutions for large scale search problems because of high administration costs of setting up large numbers of installations and because it turns out to be hard in practice to direct queries to the appropriate local search systems. In this project we will research a radical new approach to distribute search: distributed information retrieval by means of keyword auctions.

Keyword auctions like Google's AdWords give advertisers the opportunity to provide targeted advertisements by bidding on specific keywords, for instance by bidding on today's hottest query britney spears. Analogous to these keyword auctions, local search systems will bid for keywords at a central broker. They “pay” by serving queries for the broker. The broker will send queries to those local search systems that optimize the overall effectiveness of the system, i.e., local search systems that are willing to serve many queries, but also are able to provide high quality results. The project will approach the problem from three different angles: 1) modeling the local search system, including models for automatic bidding and multi-word keywords; 2) modeling the search broker's optimization using the bids, the quality of the answers, and click-through rates; 3) integration of structured data typically available behind web forms of local search systems with text search. The approaches will be evaluated using prototype systems and simulations on benchmark test collections.

See: NWO news (in Dutch)

Scientific programmer and post-doctoral positions

We have two job positions in the MultimediaN project.

Position 1: Speech Technology
SHoUT is an open source speech recognition toolkit developed at the University of Twente. SHoUT is a Dutch acronym for: “Spraak Herkennings Onderzoek Universiteit Twente”, or in English: “Speech Recognition Research at the University of Twente”. SHoUT is used to aid research on large vocabulary continuous speech recognition, including research into the application of statistical language models, audio segmentation and classification, speaker diarization and machine learning hyper parameter estimation for speech recognition.

Position 2: Search Engine Technology
PF/Tijah (Pathfinder/Tijah, pronounce as “Pee Ef Teeja”) is a flexible open source text search system developed at the University of Twente in cooperation with CWI Amsterdam and TU München. The system is integrated in the Pathfinder XQuery compiler and can be downloaded as part of the MonetDB/XQuery database system. PF/Tijah is used to aid research in information retrieval at the University of Twente, including the application of language models to search, entity retrieva, and implementation of the W3C candidate recommendation XQuery Full-Text.

[Official Job Advertisement] (deadline: August 1, 2008)

Henning Rode defends Ph.D. thesis on Entity Ranking

From Document to Entity Retrieval: Improving Precision and Performance of Focused Text Search

by Henning Rode

Text retrieval is an active area of research since decades. Finding the best index terms, the development of statistical models for the estimation of relevance, using relevance feedback, and the challenge to keep retrieval tasks efficient with ever growing text collections had been important issues over the entire period. Especially in the last decade, we have also seen a diversification of retrieval tasks. Instead of searching for entire documents only, passage or XML retrieval systems allow to formulate a more focused search for finer grained text units. Question answering systems even try to pinpoint the part of a sentence that contains the answer to a user question, and expert search systems return a list of persons with expertise on the topic. The sketched situation forms the starting point of this thesis, which presents a number of task specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.

In the first case, we show how different types of context information can be incorporated in the retrieval of documents. When users are searching for information, the search task is typically part of a wider working process. This search context, however, is often not reflected by the few search keywords stated to the retrieval system, though it can contain valuable information for query refinement. We address with this work two research questions related to the aim of developing context aware retrieval systems. Firstly, we show how already available information about the user's context can be employed effectively to gain highly precise search results. Secondly, we investigate how such meta-data about the search context can be gathered. The proposed “query profiles” have a central role in the query refinement process. They automatically detect necessary context information and help the user to explicitly express context dependent search constraints. The effectiveness of the approach is tested with experiments on selected dimensions of the user's context.

When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is straightforward to develop retrieval systems that make use of the additional structure information. Structured retrieval first asks for the design of a suitable language that enables the user to express queries on content and structure. We investigate here existing query languages, whether and how they support the basic needs of structured querying. However, our main focus lies on the efficiency of structured retrieval systems. Conventional inverted indices for document retrieval systems are not suitable for maintaining structure indices. We identify base operations involved in the execution of structured queries and show how they can be supported by new indices and algorithms on a database system. Efficient query processing has to be concerned with the optimization of query plans as well. We investigate low level query plans of physical database operators for simple query patterns, and demonstrate the benefits of higher level query optimization for complex queries.

New search tasks and interfaces for the presentation of search results, like faceted search applications, question answering, expert search, and automatic timeline construction, come with the need to rank entities, such as persons, organizations or dates, instead of documents or text passages. Modern language processing tools are able to automatically detect and categorize named entities in large text collections. In order to estimate their relevance to a given search topic, we develop retrieval models for entities which are based on the relevance of texts that mention the entity. A graph-based relevance propagation framework is introduced for this purpose that enables to derive the relevance of entities. Several options for the modeling of entity containment graphs and different relevance propagation approaches are tested, demonstrating usefulness of the graph-based ranking framework.

Download Henning's thesis from EPrints.