How many machines do we need to search and manage an index of billions of documents? In this lecture, I will discuss basic techniques for indexing very large document collections. I will discuss inverted files, index compression, and top-k query optimization techniques, showing that a single desktop PC suffices for searching billions of documents. An important part of the lecture will be spend on estimating index sizes and processing times. At the end of the afternoon, students will have a better understanding of the scale of the web and its consequences for building large-scale web search engines, and students will be able to implement a cheap but powerful new 'Google'.
To be presented at the SIKS Course Advances in Information Retrieval on 18, 19 June in Vught, The Netherlands.