by Almer Tigelaar and Djoerd Hiemstra
Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.
Learning to Merge Search Results for Efficient Distributed Information Retrieval
Kien Tjin-Kam-Jet and Djoerd Hiemstra
Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which learn a function that merges results based on information that is readily available, i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM methods do not improve over those baselines
You can find the home work assignments for XML & Databases 1 in the “Assignments” section on Blackboard.
Deadline: Friday, 2 April 2010.
Keyword Suggestion for Search Engine Marketing
by Ralf Schimmel
Every person acquainted with the web, is also a frequent user of search engines like Yahoo and Google. Any person with a web site makes this web site with a vision in mind, most of the times this entails being found on the web. Search engines offer several methods to users that help them to be found. One group of the techniques used in this field is Search Engine Optimization (SEO), which covers everything that can be done to optimize a web site for the search engine. The whole idea of SEO is to ensure that a web site is listed in the set of search results once a matching query is entered by a user. A second important part of the search engines is Search Engine Advertisement (SEA). Billions of dollars are paid by companies that bid on keywords that match their advertisements to a users query. These keywords are hard to find, of course a company knows what it sells, but it does not know how the users search for the same products or services. Advertising in search engines can be done in multiple ways. The focus of this research lies in finding many long-tail keywords, words that often have a low search volume, but which are cheap (low competition) and which are often specific enough to ensure high conversion rates (a visitor becomes a customer). Several keyword suggestion techniques are researched and evaluated for practical use. One applicable technique is chosen, implemented and evaluated. The chosen technique is a web based technique which is using an undirected weighted graph of candidate terms (nodes), where the weight of the vertices is the semantic similarity between the two nodes, and where the term frequency of the term is stored in the node. The evaluation shows that it is a technique capable of suggesting a lot of relevant keywords that can be used for search engine marketing. According to the evaluation the technique is capable of using the term frequencies and the semantic similarities to find and rank suggestions based on popularity and relevance. The most important conclusion is that, for single term suggestions, the system outperforms Google's suggestion system. Google's precision on single term suggestions is better then the precision of the new tool, however the relative recall of Google is a lot worse, for both obvious and non-obvious single term suggestions. Currently the tool can only be used to complement Google's tool, however once extended with support for multi term suggestions it can replace the entire system.
Searching for Broadcast News Items Using Language Models of Concepts
by Robin Aly, Aiden Doherty, Djoerd Hiemstra, and Alan Smeaton
Current video search systems commonly return video shots as results. We believe that users may better relate to longer, semantic video units and propose a retrieval framework for news story items, which consist of multiple shots. The framework is divided into two parts: (1) A concept based language model which ranks news items with known occurrences of semantic concepts by the probability that an important concept is produced from the concept distribution of the news item and (2) a probabilistic model of the uncertain presence, or risk, of these concepts. In this paper we use a method to evaluate the performance of story retrieval, based on the TRECVID shot-based retrieval groundtruth. Our experiments on the TRECVID 2005 collection show a significant performance improvement against four standard methods.
The paper will be presented at the 32nd European Conference on Information Retrieval (ECIR) in Milton Keynes, UK. (and in the DB colloquium of 24 March)
The newest stable release of MonetDB/PF/Tijah is now on-line, version 0.13.0 as part of Pathfinder 0.36.1
More info on the PF/Tijah site.
Expanding the usability of recorded lectures: A new age in teaching and classroom instruction
by Erwin de Moel
The status of recorded lectures at Delft University of Technology has been studied in order to expand its usability in their present and future educational environment. Possibilities for the production of single file vodcasts have been tested. These videos allow for an increased accessibility of their recorded lectures through the form of other distribution platforms. Furthermore the production of subtitles has been studied. This was done with an ASR system called SHoUT, developed at University of Twente, and machine translation of subtitles into other languages. SHoUT generated transcripts always require post-processing for subtitling. Machine translation could produce translated subtitles of sufficient quality. Navigation of recorded lectures needs to be improved, requiring input of the lecturer. Collected metadata from lecture chapter titles, slide data (titles, content and notes) as well as ASR results have been used for the creation of a lecture search engine, which also produces interactive tables of content and tag clouds for each lecture. Recorded lectures could further be enhanced with time-based discussion boards, for the asking and answering of questions. Further improvements have been proposed for allowing recorded lectures to be re-used in recurring online-based courses.
You now officially have a 50% chance of getting a job at Google. 😉
Google hired about half the students who took Bisciglia's first class.
Read the Register's article on MapReduce.
Data-Intensive Text Processing with MapReduce
An interesting book of by Jimmy Lin and Chris Dyer is forthcoming, in which they show how MapReduce can be used to solve large-scale text processing problems, including examples that use Expectation Maximization training.
This book is about MapReduce algorithm design, particularly for text processing applications. Although our presentation most closely follows implementations in the Hadoop open-source implementation of MapReduce, this book is explicitly not about Hadoop programming. We don't for example, discuss APIs, driver programs for composing jobs, command-line invocations for running jobs, etc.
See pre-prints of the book.
Final grades for the course are out. You find them on Blackboard's personal grade center.