Being Omnipresent to be Almighty

The Importance of the Global Web Evidence for Organizational Expert Finding

by Pavel Serdyukov and Djoerd Hiemstra

Modern expert finding algorithms are developed under the assumption that all possible expertise evidence for a person is concentrated in a company that currently employs the person. The evidence that can be acquired outside of an enterprise is traditionally unnoticed. At the same time, the Web is full of personal information which is sufficiently detailed to judge about a person's skills and knowledge. In this work, we review various sources of expertise evidence outside of an organization and experiment with rankings built on the data acquired from six different sources, accessible through APIs of two major web search engines. We show that these rankings and their combinations are often more realistic and of higher quality than rankings built on organizational data only.

The paper will be presented at the Future Challenges in Expertise Retrieval fCHER workshop in Singapore

[download pdf]

Sound ranking algorithms for XML search

by Djoerd Hiemstra, Stefan Klinger, Henning Rode, Jan Flokstra, and Peter Apers

Ranking algorithms for XML should reflect the actual combined content and structure constraints of queries, while at the same time producing equal rankings for queries that are semantically equal. Ranking algorithms that produce different rankings for queries that are semantically equal are easily detected by tests on large databases: We call such algorithms not sound. We report the behavior of different approaches to ranking content-and-structure queries on pairs of queries for which we expect equal ranking results from the query semantics. We show that most of these approaches are not sound. Of the remaining approaches, only 3 adhere to the W3C XQuery Full-Text standard.

The paper will be presented at the SIGIR 2008 Workshop on Focused Retrieval in Singapore

[download pdf]

Seminar on Searching and Ranking

Nelly Litvak and I organize a small but really interesting seminar before the PhD defense of Henning Rode on 27 June 2008: The first SIKS/Yahoo Seminar on Searching and Ranking in Structured Text Repositories. The goal of the one day seminar is to bring together researchers from companies and academia working in the area of computer science and applied mathematics on ranking and searching in highly dynamic, structured text environments. Keynote speakers are:

  • Ricardo Baeza-Yates (Yahoo! Research, Barcelona, Spain)
  • Debora Donato (Yahoo! Research, Barcelona, Spain)

The seminar is sponsored by: WGI, CTIT, NWO, SIKS, and Yahoo. Please send your name and affiliation to ssr (at) if you plan to participate in the seminar.

Joost de Wit graduates on evaluating recommender systems

Recommender systems use knowledge about a user’s preferences (and those of others) to recommend them items that they are likely to enjoy. Recommender system evaluation has proven to be challenging since a recommender system’s performance depends on, and is influenced by many factors. The data set on which a recommender system operates for example has great influence on its performance. Furthermore, the goal for which a system is evaluated may differ and therefore require different evaluation approaches. Another issue is that the quality of a system recorded by the evaluation is only a snapshot in time since it may change gradually. Although there exists no consensus among researchers on what recommender system’s attributes to evaluate, accuracy is by far the most popular dimension to measure. However, some researchers believe that user satisfaction is the most important quality attribute of a recommender and that greater user satisfaction is not achieved by an ever increasing accuracy. Other dimensions for recommender system evaluation that are described in literature are coverage, confidence, diversity, learning rate, novelty and serendipity. It is believed that these dimensions contribute in some way to the user satisfaction achieved by a recommender system.

Joost performed a user study for which 133 people subscribed to an evaluation application specially designed and build for this purpose. The user study consisted of two phases. During the first phase users had to rate TV programmes they were familiar with or that they recently watched. This phase resulted in 36.353 programme ratings for 7.844 TV programmes. Based on this data, the recommender system that was part of the evaluation application could start generating recommendations. In phase two of the study the application displayed recommendations for tonight’s TV programmes to its users. These recommendation lists were deliberately varied with respect to the accuracy, diversity, novelty and serendipity dimensions. Another dimension that was altered was programme overlap. Users were asked to provide feedback on how satisfied they were with the list. Over a period of four weeks 70 users provided 9762 ratings for the recommendation lists. For each of the recommendation lists that were rated in the second phase of the user study, the five dimensions (accuracy, diversity, novelty and serendipity) were measured using 15 different metrics. For each of these metrics its correlation with user satisfaction was determined using Spearman’s rank correlation. These correlation coefficients indicate whether there exists a relation between that metric and user satisfaction and how strong this relation is. It appeared that accuracy is indeed the most important dimension in relation to user satisfaction. Other metrics that had a strong correlation were user’s diversity, series level diversity, user’s serendipity and effective overlap ratio. This indicates that diversity, serendipity and programme overlap are important dimensions as well, although to lesser extent.

[more info] [download pdf]

A Probabilistic Ranking Framework using Unobservable Binary Events for Video Search

by Robin Aly, Djoerd Hiemstra, Arjen de Vries, and Franciska de Jong

CIVR 2008, Niagara Falls This paper concerns the problem of search using the output of concept detectors (also known as high-level features) for video retrieval. Unlike term occurrence in text documents, the event of the occurrence of an audiovisual concept is only indirectly observable. We develop a probabilistic ranking framework for unobservable binary events to search in videos, called PR-FUBE. The framework explicitly models the probability of relevance of a video shot through the presence and absence of concepts. From our framework, we derive a ranking formula and show its relationship to previously proposed formulas. We evaluate our framework against two other retrieval approaches using the TRECVID 2005 and 2007 datasets. Especially using large numbers of concepts for retrieval results in good performance. We attribute the observed robustness against the noise introduced by less related concepts to the effective combination of concept presence and absence in our method. The experiments show that an accurate estimate for the probability of occurrence of a particular concept in relevant shots is crucial to obtain effective retrieval results.

The paper will be presented at the ACM International Conference on Image and Video Retrieval CIVR 2008 in Niagara Falls, Canada

[download pdf]

DB Master Students Colloquium

Next Friday 25 April March there will be a DB master students colloquium at 13.45 h. in ZI-3126 with two speakers:

  • Alex van Oostrum will talk about: “The design of an object- and aspect oriented framework to facilitate software development of enterprise components”
  • Matthijs Ooms will talks about: “Provenance of Biomedical data”

Web-portal over kamp Buchenwald

Op vrijdag 11 april 2008 jl. publiceerde het Nederlands Instituut voor Oorlogsdocumentatie (NIOD) een web-portal over kamp Buchenwald, Binnen dit portal is algemene informatie over de historie van het kamp Buchenwald aanwezig, kan men de documentaire over het kamp bekijken en staan 37 interviews (ruim 60 uur) met oud-Buchenwalders online. Het nieuwe aan deze portal is dat men de interviews niet alleen integraal kan bekijken maar dat men er ook in kan zoeken. Dit laatste is mogelijk gemaakt door de afdelingen Human Media Interaction (HMI) en Databases (DB), beide onderdeel van het onderzoeksinstituut CTIT van de Universiteit Twente. Het gesproken woord van de overlevenden is door HMI ontsloten met behulp van spraaktechnologie, en dit is samen met de conventionele metadata bij de collectie (beschrijvingen en persoonsprofielen) doorzoekbaar gemaakt via de PF/Tijah zoekmachine, mede ontwikkeld door de UT DB groep. Hierdoor zijn de interviews online toegankelijk via zowel de traditionele metadata als via het letterlijke, gesproken woord van de overlevenden.

De Buchenwald-interviewcollectie bestaat uit 38 interviews met overlevenden en bevat in totaal zo´n 60 uur video. Met een tekstweergave van wat er letterlijk werd gezegd tijdens elk interview -de zogenaamde spraaktranscripties- is het mogelijk gemaakt om te zoeken in het gesproken woord van de overlevenden van kamp Buchenwald. Doordat bekend is welk woord op welk moment in welk interview gesproken werd, kan precies naar de plek in het interview verwezen worden waar het ging over, bijvoorbeeld, “werk in de fabrieken”. Voor gebruikers heeft dit meerdere voordelen: het is mogelijk om te weten te komen of bepaalde woorden wel of niet gezegd zijn, zonder het volledige interview af te luisteren; het is mogelijk direct de gevonden fragmenten te beluisteren zonder het hele interview te moeten afluisteren; het is mogelijk te “rekenen” aan de interviews (´hoe vaak werden bepaalde woorden door wie gebruikt´). Hoewel de spraakherkenner, zeker in het geval van een erfgoedcollectie als de Buchenwaldinterviews, regelmatig steekjes zal laten vallen, kan het resultaat van de herkenning heel goed gebruikt worden om in de interviews te zoeken. Voor gebruikers van gesproken collecties, zoals documentairemakers en onderzoekers, kan er daarom veel gaan veranderen met de komst zoekmachines zoals het hier beschreven systeem (zie de o.a. PF/Tijah site). Door de digitalisering en ontsluiting van audio- en videocollecties via Internet is het niet meer nodig om in persoon naar een archief te gaan, maar wordt het mogelijk om vanachter je eigen werkplek gesproken erfgoedmateriaal te benaderen. Daarnaast hoeft dit soort collecties niet meer van begin tot eind afgeluisterd te worden, maar kan de gebruiker door te zoeken in het gesproken woord heel specifieke fragmenten opvragen en direct beluisteren.

Het gesproken woord van de overlevenden van kamp Buchenwald is te doorzoeken via De projecten waarbinnen de zoekfunctionaliteit is ontwikkeld zijn CHoral, een NWO-CATCH project, en MultimediaN.