Workshop on Heterogeneous Information Access

We organize a workshop on Heterogeneous Information Access hosted by the 8th International Conference on Web Search and Data Mining on 6 February 2015 in Shanghai, China

Invited speakers: Mounia Lalmas (Yahoo) and Milad Shokouhi, (Microsoft Research)

Information access is becoming increasingly heterogeneous. Especially when the user's information need is for exploratory purpose, returning a set of diverse results from different resources could benefit the user. For example, when a user is planning a trip to China on the Web, retrieving and presenting results from vertical search engines like travel, flight information, map and Q2A sites could satisfy the user's rich and diverse information need. This heterogeneous search aggregation paradigm is useful in many contexts and brings many new challenges.

Aggregated search and composite retrieval are two in- stances of this new heterogeneous information access paradigm. They are applied on the Web with heterogeneous vertical search engines. This paradigm can be useful in many other scenarios: a user aims to re-find comprehensive information about his query in his personal search (emails, slides); or a user searches and gathers different nugget information (e.g. an entity) from a set of RDF Web datasets (e.g., DBpedia, IMDB, etc.); or the user searches a set of different files (e.g., images, documents) in a peer-to-peer online file sharing systems.

This is an emerging area as different services provided are becoming more heterogeneous and complex. Therefore, there are a number of directions that might be interesting for the research and industrial community. How to select the most relevant resources and present them concisely in order to best satisfy the user? How to model the complex user behaviour in this search scenario? How can we evaluate the performance of these systems? Those are a few key interesting research questions to study for heterogeneous information access.

The workshop topics of interest are within the context of heterogeneous information access. They include but are not limited to:

  • User modeling for Heterogeneous Information Access, Personalization
  • Metrics, measurements, and test collections
  • Optimization: Resource and vertical selection, Result presentation and diversification
  • Applications: Aggregated/Federated search, Composite retrieval, Structured/Semantic search, P2P search

The workshop includes invited talks by leading researchers in the field from both industry and academia, presentations by contributed submissions as well as organized and open discussion on heterogeneous information access.

More information at: http://hia-workshop.com/.

The future of TREC FedWeb

Thanks everyone for submitting runs to one of the TREC Federated Web Search tasks. We had roughly the same number of participants as last year; not bad, although our goal was to grow. Interestingly, our automatic submission system received an amazing 917 runs.

We discussed the future of the FedWeb track, and we decided that we will not propose a FedWeb 2015 track as coordinators. We were unable to secure funding. Combined with the fact that we created the FedWeb collection for three years in a row (although the first time independently of TREC), we believe it is best to properly finish the TREC this year, but not to run again next year. Read more…

Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Ke Zhou, and Djoerd Hiemstra

Aligning Vertical Collection Relevance with User Intent

by Ke Zhou Thomas Demeester Dong Nguyen, Djoerd Hiemstra, and Dolf Trieschnigg

Selecting and aggregating different types of content from multiple vertical search engines is becoming popular in web search. The user vertical intent, the verticals the user expects to be relevant for a particular information need, might not correspond to the vertical collection relevance, the verticals containing the most relevant content. In this work we propose different approaches to define the set of relevant verticals based on document judgments. We correlate the collection-based relevant verticals obtained from these approaches to the real user vertical intent, and show that they can be aligned relatively well. The set of relevant verticals defined by those approaches could therefore serve as an approximate but reliable ground-truth for evaluating vertical selection, avoiding the need for collecting explicit user vertical intent, and vice versa.

To be presented at the ACM International Conference on Information and Knowledge Management (CIKM 2014) in Shanghai, China on 3-7 November 2014

[download pdf]

Evaluate FedWeb runs online

The TREC Federated web track provides a new online tool to check the syntax of your runs and provide preliminary evaluation results on 10 of the 75 provided topics. Now you can easily see how you compare to other runs submitted to the system. The official TREC evaluation results will be based on at least 50 of the remaining topics in your run. Check your run at:
http://circus.ewi.utwente.nl/fedweb/.

FedWeb Please note that the site does NOT submit runs to TREC. Submit your runs at TREC via the TREC active participants site: before August 18, 2014 (Resource & Vertical Selection); before September 15, 2014 (Results Merging).

Follow @TRECFedWeb on Twitter.

Yoran Heling graduates on peer selection in Direct Connect

by Yoran Heling

In a distributed Peer-to-peer (P2P) system such as Direct Connect, files are often distributed over multiple source peers. It is up to the downloading peer to decide from how many and from which source peers to download the particular file of interest. Biased Random Period Switching (BRPS) is an algorithm, implemented at the downloading peer, that determines at what point to download from which source peer. The number of source peers that a downloading peer downloads from at a certain point is called the Degree of Parallelism (DoP). This research focussed on implementing BRPS in an existing Direct Connect client and comparing the downloading performance against an unmodified client. Two implementations of BRPS in Direct Connect have been made. A simple implementation that follows the original BRPS algorithm as closely as possible, with minor modifications that were required to ensure that the downloading process would not get stuck on an unavailable source peer. An improved implementation has also been made with slight modifications to the original BRPS algorithm. The improved implementation incorporates two improvements to ensure that the DoP does not drop below its desired value in the face of unavailable source peers.

The original client and the two BRPS implementations have been evaluated in a controlled Direct Connect network with 50 downloading peers and a variable number of source peers. The source peers have been configured to throttle their available bandwidth to an average of 500 KB/s, and following a realistic bandwidth distribution based on measurements from the Tor P2P network. The experiments consisted of all downloading peers downloading the same file at the same time, and taking measurements on the side of these downloading peers. Four experiments have been performed, with one varying parameter in each experiment. The size of the file being downloaded was varied between 100 MB and 1024 MB in the first experiment, the second experiment varied the DoP between 1 and 15. The number of source peers was varied between 10 and 100 in the third experiment, and in the last experiment between 0% and 80% unavailable source peers were added to the network.

In all experiments, both BRPS implementations performed close to the optimal average download time, and were consistently faster than the original client by a factor of 2 to 5. In the last experiment, the improved BRPS implementation did keep the measured DoP closer to its desired value than the simple implementation, but this has not resulted in a significant difference in the measured download times.

[download pdf]

The Lowlands at TREC

by Robin Aly, Djoerd Hiemstra, Dolf Trieschnigg, and Thomas Demeester

We describe the participation of the Lowlands at the Web Track and the FedWeb track of TREC 2013. For the Web Track we used the MIREX MapReduce library with out-of-the-box approaches. For the FedWeb Track we adapted our shard selection method Taily for resource selection. Our results are above the median performance of TREC participants.

Presented at the 22nd Text REtrieval Conference (TREC) at the USA National Institute of Standards and Technology (NIST) in Gaithersburg, USA

[download pdf]

Exploiting User Disagreement for Web Search Evaluation

Exploiting User Disagreement for Web Search Evaluation: An experimental approach

by Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Dolf Trieschnigg, and Chris Develder

To express a more nuanced notion of relevance as compared to binary judgments, graded relevance levels can be used for the evaluation of search results. Especially in Web search, users strongly prefer top results over less relevant results, and yet they often disagree on which are the top results for a given information need. Whereas previous works have generally considered disagreement as a negative effect, this paper proposes a method to exploit this user disagreement by integrating it into the evaluation procedure. First, we present experiments that investigate the user disagreement. We argue that, with a high disagreement, lower relevance levels might need to be promoted more than in the case where there is global consensus on the top results. This is formalized by introducing the User Disagreement Model, resulting in a weighting of the relevance levels with a probabilistic interpretation. A validity analysis is given, and we explain how to integrate the model with well-established evaluation metrics. Finally, we discuss a specific application of the model, in the estimation of suitable weights for the combined relevance of Web search snippets and pages.

To be presented at the 7th ACM Conference on Web Search and Data Mining (WSDM) in New York City, USA on 24-28 February.

[Read more]

STW Valorization Grant for Q-Able

Q-Able.com The University of Twente spin-off Q-Able receives a Valorization Grant Phase 1 from the Dutch Technology Foundation STW to further develop and market their OneBox search technology.

When it comes to web applications, users love the “single text box” interface because it is extremely easy to use. However, much information on the web is stored in structured databases and can only be accessed by filling out a web form with multiple input fields. Examples include planning a trip, booking a hotel room, looking for a second-hand car, etc.

The approach of web search engines – to crawl sites and make a central index of the pages – does not suffice in many cases. First, some sites are hard to crawl because the pages can only be accessed via the web form. Second, some sites provide information that changes quickly, like available hotel rooms, and crawled pages would be almost immediately outdated. Third, some sites provide information that is generated dynamically, like planning a trip from one address to another on a certain date, and it is impossible to crawl all combinations of addresses and dates. Finally, a simple text index that search engines provide does not easily allow structured queries on arbitrary fields. In all these cases, the sites that provide such information can be found using a search engine like Google, but the information itself can only be retrieved after filling in a web form. Filling in one or more web forms with many fields can be a tedious job.

Q-Able replaces a site's web forms by OneBox, a simple text field, giving complex sites the look and feel of Google: a single field for asking questions and performing simple transactions. OneBox allows users to plan a trip by typing for instance “Next Wednesday from Enschede to Amsterdam arriving at 9am”, or to search for second-hand cars by typing “Ford C-max 4-doors less than 200,000 kilometres from before 2008”. OneBox can be configured to operate on any web site that provides complex web forms. Furthermore, OneBox can be configured to operate on multiple web sites using a single simple text field. This way, to search for instance for a second-hand car, users enter a single query, and search multiple second-hand car sites with a single click. OneBox only replaces the user interface of a web database: It does not copy, crawl or otherwise index the data itself.

OneBox, is the result of the Ph.D. research project done by Kien Tjin-Kam-Jet at the University of Twente. His research identified several successful novel approaches to query understanding by combining rule-based approaches with probabilistic approaches that rank query interpretations. Furthermore, the research resulted in an efficient implementation of OneBox, that needs only a fraction of a second to interpret queries even in complex configurations for accessing multiple web databases. Treinplanner.info, Q-Able's first public demonstration of OneBox, demonstrates natural search for the Dutch Railways (Nederlandse Spoorwegen) travel planner, and was well-received in user questionnaires, on Twitter, and on the Dutch national public radio and television. Q-Able will use the STW valorisation grant to investigate the technical and and commercial feasibility of OneBox.

TREC Federated Web Search track

http://sites.google.com/site/trecfedweb/

First submission due on August 11, 2013

The Federated Web Search track is part of NIST's Text REtrieval Conference TREC 2013. The track investigates techniques for the selection and combination of search results from a large number of real on-line web search services. The data set, consisting of search results from 157 search engines is now available. The search engines cover a broad range of categories, including news, books, academic, travel, etc. We have included one big general web search engine, which is based on a combination of existing web search engines.

Federated search is the approach of querying multiple search engines simultaneously, and combining their results into one coherent search engine result page. The goal of the Federated Web Search (FedWeb) track is to evaluate approaches to federated search at very large scale in a realistic setting, by combining the search results of existing web search engines. This year the track focuses on resource selection (selecting the search engines that should be queried), and results merging (combining the results into a single ranked list). You may submit up to 3 runs for each task. All runs will be judged. Upon submission, you will be asked for each run to indicate whether you used result snippets and/or pages, and whether any external data was used. Precise guidelines can be found at:
http://sites.google.com/site/trecfedweb/

WWW Fed

Track coordinators

  • Djoerd Hiemstra – University of Twente, The Netherlands
  • Thomas Demeester – Ghent University, Belgium
  • Dolf Trieschnigg – University of Twente, The Netherlands
  • Dong Nguyen – University of Twente, The Netherlands