Language models, smoothing and n-grams

A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model based on a big English newspaper archive is expected to assign a higher probability to “a bit of text” than to “aw pit tov tags”, because the words in the former phrase (or word pairs or word triples if so-called n-gram models are used) occur more frequently in the data than the words in the latter phrase. For information retrieval, typical usage is to build a language model for each document. At search time, the top ranked document is the one which’ language model assigns the highest probability to the query.

The term language models originates from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s. Speech recognition systems use a language model to complement the results of the acoustic model which models the relation between words (or parts of words called phonemes) and the acoustic signal. The history of language models, however, goes back to beginning of the 20th century when Andrei Markov used language models (Markov models) to model letter sequences in works of Russian literature. Another famous application of language models are Claude Shannon’s models of letter sequences and word sequences, which he used to illustrate the implications of coding and information theory. In the 1990’s language models were applied as a general tool for several natural language processing applications, such as part-of-speech tagging, machine translation, and optical character recognition. Language models were applied to information retrieval by a number of research groups in the late 1990’s. They became rapidly popular in information retrieval research. By 2001, the ACM SIGIR conference had two separate sessions on language models containing 5 papers in total. In 2003, a group of leading information retrieval researchers published a research road map “Challenges in Information Retrieval and Language Modeling”, indicating that the future of information retrieval and the future of language modeling can not be seen apart from each other.

This entry will soon be published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Özsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia's online version will be accessible on the SpringerLink platform. Click here for more information about the Encyclopedia of Database Systems.

[draft]

Improved query difficulty prediction for the web

by Claudia Hauff, Vanessa Murdock, Ricardo Baeza-Yates

Query performance prediction aims to predict whether a query will have a high average precision given retrieval from a particular collection, or low average precision. An accurate estimator of the quality of search engine results can allow the search engine to decide to which queries to apply query expansion, for which queries to suggest alternative search terms, to adjust the sponsored results, or to return results from specialized collections. In this paper we present an evaluation of state of the art query prediction algorithms, both post-retrieval and pre-retrieval and we analyze their sensitivity towards the retrieval algorithm. We evaluate query difficulty predictors over three widely different collections and query sets and present an analysis of why prediction algorithms perform significantly worse onWeb data. Finally we introduce Improved Clarity, and demonstrate that it outperforms state-of-the-art predictors on three standard collections, including two large Web collections.

The paper will be presented at the ACM Conference on Information and Knowledge Management CIKM 2008 in Napa Valley, USA

[download pdf]

An Image Retrieval Interface for Children

by Sander Bockting, Matthijs Ooms, Djoerd Hiemstra, Paul van der Vet, and Theo Huibers

Studies on information retrieval for children are not yet common. As young children possess a limited vocabulary and limited intellectual power, they may experience more difficulty in fulfilling their information need than adults. This paper presents an image retrieval user interface that is specifically designed for children. The interface uses relevance feedback and has been evaluated by letting children perform different search tasks. The tasks were performed using two interfaces; a more traditional interface – acting as a control interface – and the relevance feedback interface. One of the remarkable results of this study is that children did not favor relevance feedback controls over traditional navigational controls.

[download pdf]

Robert Zwerus graduates on storing “PIM” data

Storing Personal Information Management (PIM) data is not trivial, because of the variety in content types. Existing PIM storage systems have shortcomings in performance, data concistency and/or concurrency. In this thesis, we propose several optimisations and test them in Akonadi, KDE’s new central PIM data access manager. The optimisations include using the D-Bus protocol for transmitting short commands and notifications and an IMAP-compatible protocol for data access and modification. The PIM data is kept in its native format, but compressed and split up into separate, frequently-used parts for increased performance. Both the synthetic and use case based evaluation results show that the proposed modifications perform well and help maintain data consistency in Akonadi.

Read more on E-prints

Robin Zagers graduates on a decision system for Google Adwords

Advertising within search engines, such as Google, is a growing market. Sponsored hyperlinks appear besides the results where one searches for. This makes goal-oriented advertising possible. For many keywords there are several advertisers, this results in a growing competition. When a company wants to advertise on several keywords, it is difficult to manage this well. Gladior manages the advertisement campaigns for different companies. By the quantity of companies it is almost no longer possible to manage all advertisements by hand. Therefore there is a need for Gladior for an automated system for managing adwords campaigns.

For this purpose a bid management system has been developed. With this system it is possible to specify the requirements for a specific advertisement: a minimum number of visitors a week, a stable position in the list of advertisers, or a stable average price per click. These requirements determine the boundaries for the price, so that the price can be raised or reduced depending on of the behavior of the competition at that moment. For each campaign a list of advertisement words (Adwords) is collected. For each Adword the performance is requested. Based on the performance, the decision is made whether it is necessary to adjust the current settings. When an adjustment is necessary the new price is calculated and will be uploaded. In following paragraphs a closer explanation of the design of the bidmanagementsystem is described. The Google API makes it possible to down- and upload Adwords data. The requested data is then stored in a local database, after which a connection with other systems is possible. Several applications keep this information up-to-date, this makes it possible for the bidmanagementsystem to further process the data.

Processing the data starts with combining the Adwords-data with the requirements of the Adwords professional; the requirements the advertisement must fulfil. The requirements must be formulated in terms of the available indicators. This list of indicators can be extended. This information is stored in the database with management information. For each Adword the requirements are stored, they can be requested and modified at any moment. The required data from the database is selected using this management information. The selected data is used to make a prediction for the next period. This prediction is calculated using the single moving average algorithm. To prove that a different prediction algorithm has a better result then the single moving average algorithm, more data is necessary than is available at this moment. The new price can be calculated in several manners, the most simple method is raising and reducing the price with one cent. If more advertisers bid the same amount for the same Adword, this method has a large influence. Within one or two updates the desired situation is returned. With less advertisers, the difference between the Adword positions is more than a couple of cents. In this case reaching the aimed situation needs more updates. For evaluating the prototype, the working method of the Adwords professional is converted to an automated process. The evaluation shows that the bid management system realizes, for the managed Adwords campaigns, a decrease of the distance to the indicated goal. The collected Adwords- and visitor information makes it possible to calculate optimal settings per Adword. This can be used to react quicker and more accurate towards other Adwords.

More information on ePrints.

Bram Smulders graduates on real time data distribution and storage

The techniques used nowadays in a lot of research centers processing lots of sensor data, for instance wind tunnels, are often based on traditional relational databases. In some cases, no databases are used at all. No care is taken to deliver data under real time constraints. This does not have a negative effect in case data is merely stored for analysis after the measurement process has completed, but it can have disastrous effects when the data is needed for control of critical elements in the measurement process itself.

This report discusses a possible solution to real time delivery storage of data. It does not focus on sensor data. Instead, it should be flexible enough to suit any situation in which it is desirable to distribute data under real time constraints. The newly created solution, carrying the name “SQLbusRT”, is based on the blackboard architecture pattern, which will be explained in this report. A comparison is made on how the architecture of the new solution matches with the blackboard architecture. The choice for the blackboard pattern is mainly for its flexibility in the addition and removal of components to and from the system. System components will be able to work on a shared storage. This shared storage is called the blackboard, giving the name to the architecture pattern. A prototype is developed by combining readily available open source products and creating new interfaces. The open source products which are used in this project are MySQL and ORTE. MySQL is a database management system which is known for its high performance and is used on a large scale worldwide. ORTE is an implementation of the RTPS protocol, which serves as a data communication channel over Ethernet, using a publish subscribe mechanism. An explanation of ORTE and the publish subscribe mechanism is given in this report. This report discusses some tests which were executed to predict the performance, reliability and scalability of SQLbusRT in a simple setup. This set of tests can be extended in future research when SQLbusRT matures.

More info on e-Prints, and on Bram's blog

Evaluation of Multimedia Retrieval Systems

by Djoerd Hiemstra and Wessel Kraaij

In this chapter, we provide the tools and methodology for comparing the effectiveness of two or more multimedia retrieval systems in a meaningful way. Several aspects of multimedia retrieval systems can be evaluated without consulting the potential users or customers of the system, such as the query processing time (measured for instance in milliseconds per query) or the query throughput (measured for instance as the number of queries per second). In this chapter, however, we will focus on aspects of the system that influence the effectiveness of the retrieved results. In order to measure the effectiveness of search results, one must at some point consult the potential user of the system. For, what are the correct results for the query “black jaguar”? Cars, or cats? Ultimately, the user has to decide.

Download author version.

Question Answering for Dutch: Simple does it

by Arjen Hoekstra, Djoerd Hiemstra, Paul van der Vet and Theo Huibers

When people pose questions in natural language to search for information on the web, the role of question answering (QA) systems becomes important. In this paper the QAsystem simpleQA, capable of answering Dutch questions on which the answer is a person or a location, is described. The system’s algorithm does not use a lot of complex NLP-techniques, but instead uses the magnitude of and redundancy on the World Wide Web to its advantage. The system has been evaluated on the DISEQuA corpus and performed quite well: MRR near 0.5. For further improvements it can easily be extended by adding more rewrite rules and applying more sophisticated filtering and tiling.

[download pdf]

Joint HMI and DB colloquium, Friday 21 July, 14:00 h.-14:45 h. in ZI-2126

The Potential of User Feedback through the Iterative Refining of Queries in an Image Retrieval System

by Maher Ben Moussa and Marco Pasch

Inaccurate or ambiguous expressions in queries lead to poor results in information retrieval. We assume that iterative user feedback can improve the quality of queries. To this end we developed a system for image retrieval that utilizes user feedback to refine the user's search query. This is done by a graphical user interface that returns categories of images and requires the user to choose between them in order to improve the initial query in terms of accuracy and unambiguousness. A user test showed that, although there was no improvement in search time or required search restarts, iterative user feedback can indeed improve the performance of an image retrieval system in terms of user satisfaction.

Maher and Marco are M.Sc. students at the University of Twente. They will present their work at the fourth International Workshop on Adaptive Multimedia Retrieval (AMR) in Geneva, Switzerland.

[download pdf]

Using Language Models for Information Retrieval

Because of the world wide web, information retrieval systems are now used by millions of untrained users all over the world. The search engines that perform the information retrieval tasks, often retrieve thousands of potentially interesting documents to a query. The documents should be ranked in decreasing order of relevance in order to be useful to the user. This book describes a mathematical model of information retrieval based on the use of statistical language models. The approach uses simple document-based unigram models to compute for each document the probability that it generates the query. This probability is used to rank the documents. The study makes the following research contributions.

  • The development of a model that integrates term weighting, relevance feedback and structured queries.
  • The development of a model that supports multiple representations of a request or information need by integrating a statistical translation model.
  • The development of a model that supports multiple representations of a document, for instance by allowing proximity searches or searches for terms from a particular record field (e.g. a search for terms from the title).
  • A mathematical interpretation of stop word removal and stemming.
  • A mathematical interpretation of operators for mandatory terms, wildcards and synonyms.
  • A practical comparison of a language model-based retrieval system with similar systems that are based on well-established models and term weighting algorithms in a controlled experiment.
  • The application of the model to cross-language information retrieval and adaptive information filtering, and the evaluation of two prototype systems in a controlled experiment.

Experimental results on three standard tasks show that the language model-based algorithms work as well as, or better than, today’s top-performing retrieval algorithms. The standard tasks investigated are ad-hoc retrieval (when there are no previously retrieved documents to guide the search), retrospective relevance weighting (find the optimum model for a given set of relevant documents), and ad-hoc retrieval using manually formulated Boolean queries. The application to cross-language retrieval and adaptive filtering shows the practical use of respectively structured queries, and relevance feedback.

[download pdf]