Federated Search Made Easy

by Dolf Trieschnigg, Kien Tjin-Kam-Jet, and Djoerd Hiemstra

Building a federated search engine based on a large number existing web search engines is a challenge: implementing the programming interface (API) for each search engine is an exacting and time-consuming job. In this demonstration we present SearchResultFinder, a browser plugin which speeds up determining reusable XPaths for extracting search result items from HTML search result pages. Based on a single search result page, the tool presents a ranked list of candidate extraction XPaths and allows highlighting to view the extraction result. An evaluation with 148 web search engines shows that in 90% of the cases a correct XPath is suggested.

The software can be downloaded as a FireFox plugin.

SIGIR 2013 demonstration

The tool was demonstrated at the ACM SIGIR Conference in Dublin.

[download pdf]

Taily: Shard Selection Using the Tail of Score Distributions

by Robin Aly, Djoerd Hiemstra, and Thomas Demeester

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.

SIGIR 2013 presentation

Presented at the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in Dublin, Ireland, 28 July – 1 August.

[download pdf]

Study tour completed

We are back from the two week China tour organized by Inter-Actief: 28 students, 4 cities (Shanghai, Hangzhou, Beijing and Hong Kong), 14 company and university visits in 14 days!

Visit at Tsinghua University

Top 3 university visits: 1) Tsingua University in Beijing with a very warm welcome by prof. Ling Feng, excellent talks and an impressive campus tour; 2) Jiao Tong University in Shanghai with interesting talks and students demoing their design challenge results; 3) Tongji University, Shanghai with interesting presentations and campus tour.

Visit at MSR Asia

Top 3 company visits: 1) Microsoft Research Asia in Beijing with a excellent welcome by Tetsuya Sakai, some really awesome tech talks and a cool tour through the lab (see team photo); 2) Philips Design, Hong Kong with interesting talks and some of us participating in an experiment; 3) MotionGlobal, Shanghai, with very inspiring talks and an 'international' office tour. Two runner ups worth mentioning: 4) Nedap in Shanghai, and 5) Alibaba in Hangzhou.

More information at: noodle2012.nl

Bessensap 2012 en het diepe web

Djoerd bij Bessensap in het Museon Meer dan 99 procent van het wereldwijde web is op dit moment niet doorzoekbaar door zoekmachines. Daardoor blijft veel informatie ontoegankelijk. Relatief eenvoudige vragen als 'Wat is de beste treinreis van Enschede naar Amsterdam op 4 juni 2012?' en 'Wat is het telefoonnummer van Djoerd Hiemstra uit Enschede?' kunnen niet door zoekmachines als Google en Bing worden beantwoord kunnen worden. Toch is het antwoord daarvan wel degelijk beschikbaar op het web. Namelijk in het diepe web, waar zoekmachines niet kunnen komen omdat ze de pagina's niet van te voren hebben gedownload. De redenen daarvoor zijn divers en de Universiteit Twente onderzoekt methoden waarmee deze informatie toch gevonden kan worden door vragen op juiste te interpreteren, vragen naar de juiste bron te sturen en zoekresultaten te interpreteren en te integreren met resultaten van andere bronnen. De eerste demonstratie van onderzoeksresultaten uit dit onderzoek (http://treinplanner.info) kreeg sinds begin 2012 al 10.000den bezoekers.

Foto: Jan Taco te Gussinklo. Een leuk verslag is te vinden op: Dutch Button Works.

Study tour to South Korea and China

Noodle is the name of the 2012 study tour organized by study association Inter-Actief from the University of Twente. In September and October 2012 we will visit companies and universities in South Korea and China. Before the students depart they research the countries they will be visiting. All participants conduct research in one of the six research tracks defined within the tour's theme IT Integrated Lifestyle: how IT affects and enriches our daily lives.

Stucie Noodle
The Study Tour Committee: David Huistra, Lex Utama, Marijn Mensinga, Mark Oude Veldhuis, Nils van Kleef, and Yme Joustra

Follow the Noodle study tour preparations at http://noodle2012.nl.

ImagePile: an Alternative for Vertical Results Lists

by Saskia Akkersdijk, Merel Brandon, Hanna Jochmann-Mannak, Djoerd Hiemstra, and Theo Huibers

ImagePileRecent work shows that children are very well capable of searching with Google, due to their familiarity with the interface. However, children do have difficulties with the vertical list representation of the results. In this paper, we present an alternative result representation for a touch interface, the ImagePile. The ImagePile displays the results as a pile of images where the user navigates through via horizontal swiping. This representation was tested on a search engine for the Emma child hospital's library. Using a within subject experiment, both representations were tested with children to compare the usability of both systems. The vertical representation was perceived as easier to use, but the ImagePile system was considered more fun to use. Also, with the ImagePile system more relevant results were chosen by the children, and they were more aware of the number of results.

[download pdf]