Language models, smoothing and n-grams

A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model based on a big English newspaper archive is expected to assign a higher probability to “a bit of text” than to “aw pit tov tags”, because the words in the former phrase (or word pairs or word triples if so-called n-gram models are used) occur more frequently in the data than the words in the latter phrase. For information retrieval, typical usage is to build a language model for each document. At search time, the top ranked document is the one which’ language model assigns the highest probability to the query.

The term language models originates from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s. Speech recognition systems use a language model to complement the results of the acoustic model which models the relation between words (or parts of words called phonemes) and the acoustic signal. The history of language models, however, goes back to beginning of the 20th century when Andrei Markov used language models (Markov models) to model letter sequences in works of Russian literature. Another famous application of language models are Claude Shannon’s models of letter sequences and word sequences, which he used to illustrate the implications of coding and information theory. In the 1990’s language models were applied as a general tool for several natural language processing applications, such as part-of-speech tagging, machine translation, and optical character recognition. Language models were applied to information retrieval by a number of research groups in the late 1990’s. They became rapidly popular in information retrieval research. By 2001, the ACM SIGIR conference had two separate sessions on language models containing 5 papers in total. In 2003, a group of leading information retrieval researchers published a research road map “Challenges in Information Retrieval and Language Modeling”, indicating that the future of information retrieval and the future of language modeling can not be seen apart from each other.

This entry will soon be published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Özsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia's online version will be accessible on the SpringerLink platform. Click here for more information about the Encyclopedia of Database Systems.

[draft]