PF/Tijah – Page 2 – Djoerd Hiemstra

PF/Tijah at INEX

To facilitate topic development for the INEX Entity Ranking track, we developed a simple but effective INEX entity ranking demo. The demo searches in about 4.5 GB of English Wikipedia articles. It is not that fast, but it was coded in less than two days: just insert the data and write a few XQuery statements, done!

Web-portal over kamp Buchenwald

Op vrijdag 11 april 2008 jl. publiceerde het Nederlands Instituut voor Oorlogsdocumentatie (NIOD) een web-portal over kamp Buchenwald, http://www.buchenwald.nl. Binnen dit portal is algemene informatie over de historie van het kamp Buchenwald aanwezig, kan men de documentaire over het kamp bekijken en staan 37 interviews (ruim 60 uur) met oud-Buchenwalders online. Het nieuwe aan deze portal is dat men de interviews niet alleen integraal kan bekijken maar dat men er ook in kan zoeken. Dit laatste is mogelijk gemaakt door de afdelingen Human Media Interaction (HMI) en Databases (DB), beide onderdeel van het onderzoeksinstituut CTIT van de Universiteit Twente. Het gesproken woord van de overlevenden is door HMI ontsloten met behulp van spraaktechnologie, en dit is samen met de conventionele metadata bij de collectie (beschrijvingen en persoonsprofielen) doorzoekbaar gemaakt via de PF/Tijah zoekmachine, mede ontwikkeld door de UT DB groep. Hierdoor zijn de interviews online toegankelijk via zowel de traditionele metadata als via het letterlijke, gesproken woord van de overlevenden.

De Buchenwald-interviewcollectie bestaat uit 38 interviews met overlevenden en bevat in totaal zoÂ´n 60 uur video. Met een tekstweergave van wat er letterlijk werd gezegd tijdens elk interview -de zogenaamde spraaktranscripties- is het mogelijk gemaakt om te zoeken in het gesproken woord van de overlevenden van kamp Buchenwald. Doordat bekend is welk woord op welk moment in welk interview gesproken werd, kan precies naar de plek in het interview verwezen worden waar het ging over, bijvoorbeeld, â€œwerk in de fabrieken”. Voor gebruikers heeft dit meerdere voordelen: het is mogelijk om te weten te komen of bepaalde woorden wel of niet gezegd zijn, zonder het volledige interview af te luisteren; het is mogelijk direct de gevonden fragmenten te beluisteren zonder het hele interview te moeten afluisteren; het is mogelijk te â€œrekenenâ€ aan de interviews (Â´hoe vaak werden bepaalde woorden door wie gebruiktÂ´). Hoewel de spraakherkenner, zeker in het geval van een erfgoedcollectie als de Buchenwaldinterviews, regelmatig steekjes zal laten vallen, kan het resultaat van de herkenning heel goed gebruikt worden om in de interviews te zoeken. Voor gebruikers van gesproken collecties, zoals documentairemakers en onderzoekers, kan er daarom veel gaan veranderen met de komst zoekmachines zoals het hier beschreven systeem (zie de o.a. PF/Tijah site). Door de digitalisering en ontsluiting van audio- en videocollecties via Internet is het niet meer nodig om in persoon naar een archief te gaan, maar wordt het mogelijk om vanachter je eigen werkplek gesproken erfgoedmateriaal te benaderen. Daarnaast hoeft dit soort collecties niet meer van begin tot eind afgeluisterd te worden, maar kan de gebruiker door te zoeken in het gesproken woord heel specifieke fragmenten opvragen en direct beluisteren.

Het gesproken woord van de overlevenden van kamp Buchenwald is te doorzoeken via http://www.buchenwald.nl. De projecten waarbinnen de zoekfunctionaliteit is ontwikkeld zijn CHoral, een NWO-CATCH project, en MultimediaN.

Ranked XML Querying Seminar

The goal of the Dagstuhl seminar on Ranked XML Querying is to bring together researchers and practitioners from the database (DB), the information retrieval (IR) and the web/applications communities, and create an environment where the distinct communities collaboratively work on understanding the similarities and differences between their various approaches for querying XML data with heterogeneous structure and content, and benefit from each other's experiences.

The workshop was attended by 27 people from three different research communities: database systems (DB), information retrieval (IR), and Web. The seminar title was interpreted in an IR-style â€žandishâ€œ sense (it covered also subsets of {Ranking, XML, Querying}, with larger sets being favored) rather than the DB-style strictly conjunctive manner. So in essence, the seminar really addressed the integration of DB and IR technologies with Web 2.0 being an important target area.

[download report]

Structured Text Retrieval Models

by Djoerd Hiemstra and Ricardo Baeza-Yates

Structured text retrieval models provide a formal definition or mathematical framework for querying semi-structured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text model's word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like “containing” and “contained-by” to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like “I want a paragraph discussing formal models near to a table discussing the differences between databases and information retrieval”. Here, “formal models” and “differences between databases and information retrieval” should match the content that needs to be retrieved from the database, whereas “paragraph” and “table” refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed in this entry.

This entry will soon be published in the Encyclopedia of Database Systems by Springer. The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Ã–zsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content. The Encyclopedia's online version will be accessible on the SpringerLink platform. Click here for more information about the Encyclopedia of Database Systems.

[draft]

New PF/Tijah release

With the new stable release of MonetDB/XQuery (version 0.22) comes a new version (version 0.5) of PF/Tijah. We improved the main indexing data structure in this version, which is smaller and more efficient on most queries. Go to the PF/Tijah web site at:
http://dbappl.cs.utwente.nl/pftijah/.

PF/Tijah site views and downloads

The MultimediaN board asks for “Economic impact”, so I gathered some statistics on the usage of PF/Tijah:

Between 1 May 2007 and now, the site was visited 1162 times, 5127 page views in total
Between between 22 October 2007 (last release) and now, there were 1100 downloads of MonetDB/XQuery 0.20 (including PF/Tijah) 46 of those downloads, so 4.2 %, were directed from the Pf/Tijah site

PF/Tijah documentation available as technical report

by Djoerd Hiemstra, Henning Rode, and Jan Flokstra

PF/Tijah (Pathfinder/Tijah, pronounce as “Pee Ef Teeja“) is a flexible open source text search system developed at the University of Twente in cooperation with CWI Amsterdam and TU MÃ¼nchen. The system is integrated in the Pathfinder XQuery database system and can be downloaded as part of MonetDB/XQuery. This report contains user documentation of PF/Tijah, including example usage in three show cases.

For more information, see: PF/Tijah site

PF/Tijah: text search in an XML database system

by Djoerd Hiemstra, Henning Rode, Roel van Os and Jan Flokstra

This paper introduces the PF/Tijah system, a text search system that is integrated with an XML/XQuery database management system. We present examples of its use, we explain some of the system internals, and discuss plans for future work. PF/Tijah is part of the open source release of MonetDB/XQuery.

[download pdf] [more info]

PFTijah is up and running!

We've run our first NEXI query on the combined system! The NEXI query is compiled from within XQuery, executed, and the results are stored in a BAT. Jan is now working on relating the results back to Pathfinder nodes.

PFTijah Wiki public

PFTijah is the name of an internal project we started wihtin MultimediaN MN5 semantic access. The main goal of the project is creating a flexible environment for setting up search systems by integrating the PathFinder XQuery system with our Tijah XML information retrieval system. Watch the Wiki for system releases and new features.