by Sebastiaan Vercammen
Distributed search introduces problems with resources that require time to process queries and produce results, and users waiting to get an answer to their query. The system could wait a maximum amount of time for every resource to produce its results or start displaying results the very moment they are retrieved by the distributed search engine. This thesis introduces a number of alternative display strategies and describes a method to research their effectiveness in providing the most relevant results, as quickly and as high in the combined results as possible, while maintaining a user-friendly search experience. It then continues by describing the performed research and its results. For each experiment, test participants are asked a number of questions, to describe their experience operating the search engine using the specific display strategy. Also recorded are statistics concerning test participants’ clicks. These metrics are combined with the answers to the user questions and also used for determining the best display strategy. Observations were made of aspects that seemed to have influenced the experiment, such as the red color of the notifications used for one of the display strategies. The precise influence of these aspects should be further studied, by using A/B testing, as proposed in section 7.2. Finally, the conclusion is drawn that the Screen fill with “endless” scrolling display strategy (section 3.3.4) performed best when taking the test participants’ answers into account.
Federated Aggregated Search
by Andrés Marenco Zúñiga
The traditional search engine paradigm has changed from retrieving simple text documents, to selecting a broader combination of diverse document types (i.e. images, videos, maps…) that could satisfy the user’s information need. Each type of document, stored in specialized databases known as ‘verticals’, and found in either local or federated locations, is nowadays integrated into 'aggregated search engines'. Due to this domain coverage of each vertical, when a query enters the system, only the ones which are most likely to contain the desired information should be selected. To perform this selection, a text representation of each vertical is created by directly sampling a set of documents from the vertical’s search engine. However, many times the vertical representation is not descriptive enough. Reasons such as the heterogeneous nature of the documents or the lack of cooperation of the vertical could negatively affect the generation of the representation. Thus, we focus on the problem of creating an aggregated search engine which integrates federated collections in an uncooperative environment. With the help of Wikipedia as a complementary external source of information, we investigate the use of three techniques found in the literature aimed to enrich the vertical representation: a) using only Wikipedia articles as representation; b) using a combination of Wikipedia articles and the sample obtained from the vertical; and c) expanding the contents of each sampled document. We discovered how by applying latent Dirichlet allocation to model the hidden topics of documents directly sampled from each vertical it is possible to identify Wikipedia articles with the same theme coverage as the vertical. Then, we demonstrate how by using only Wikipedia articles for representation of some particular verticals, the selection task is improved. As a second point, we explored the use of the modelled topics together with Wikipedia categories to boost the score of the verticals that could be associated with the query string. Although in this case our results are inconclusive, the experiments suggest that by applying query classification and then matching obtained categories with the verticals' categories it is possible to increase the effectiveness of the vertical selection task.
by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Ke Zhou, and Djoerd Hiemstra
The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants' results for the tasks are introduced, analyzed, and compared.
Presented at the 23rd Text Retrieval Conference (TREC) in Gaithersburg, USA
Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank
by Niek Tax
Learning to rank is an increasingly important task within the scientific fields of machine learning and information retrieval, that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated in terms of ranking accuracy on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by non-existence of a standard set of evaluation benchmark collections. Furthermore, little research is done in the field of scalability of the training procedure of Learning to Rank methods, to prepare us for input data sets that are getting larger and larger. This thesis concerns both the comparison of Learning to Rank methods using a sparse set of evaluation results on benchmark data sets, as well as the speed-up that can be achieved by parallelising Learning to Rank methods using MapReduce.
In the first part of this thesis we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: 1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and 2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF were found to be the best performing learning to rank methods in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number. Of these ranking algorithms, FenchelRank and FSMRank are pairwise ranking algorithms and the others are listwise ranking algorithms.
In the second part of this thesis we analyse the speed-up of the ListNet training algorithm when implemented in the MapReduce computing model. We found that running ListNet on MapReduce comes with a job scheduling overhead in the range of 150-200 seconds per training iteration. This makes MapReduce very inefficient to process small data sets with ListNet, compared to a single-machine implementation of the algorithm. The MapReduce implementation of ListNet was found to be able to offer improvements in processing time for data sets that are larger than the physical memory of the single machine otherwise available for computation. In addition we showed that ListNet tends to converge faster when a normalisation preprocessing procedure is applied to the input data. The training time of our cluster version of ListNet was found to grow linearly in terms of data size increase. This shows that the cluster implementation of ListNet can be used to scale the ListNet training procedure to arbitrarily large data sets, given that enough data nodes are available for computation.
ACM SIGIR is the major international forum for the presentation of new research results and for the demonstration of new systems and techniques for Information Retrieval (IR). SIGIR 2015 will be held in Santiago, Chile on 9-13 August 2015. The Conference and Program Chairs invite all those working in areas related to IR to submit original proposals for demonstrations.
Demonstrations present first-hand experience with research prototypes or operational systems. They provide opportunities to exchange ideas gained from implementing IR systems, and to obtain feedback from expert users. Accepted demonstration submissions will appear in the conference proceedings.
What makes a good demo?
A good demonstration submission is interesting to a SIGIR audience and shows a novel solution to a problem. The demonstration submission should address the following questions: What problem does my system solve? Who is my target user? What does my demonstration do and how does it work? How does it compare with existing systems? Finally, how and when will my technology have an impact? Demonstration submissions are welcome in any of the areas related to aspects of Information Retrieval (IR), as identified in the call for papers on the SIGIR website.
- Submissions deadline (tentative): 18 February 2015.
- Acceptance notifications (tentative): 20 April 2015.
Read the full Call for Demonstrations
Scientific and economic progress is increasingly powered by our capabilities to explore big datasets. Data is the driving force behind the successful innovation of Internet companies like Google, Twitter, and Yahoo, and job advertisements show an increasing need for data scientists and big data analysts. Data scientists dig for value in data by analyzing for instance texts, application usage logs, and sensor data. The need for data scientists and big data analysts is apparent in almost every sector in our society, including business, health care, and education.
The Twente Center for Data Science is a collaboration between research groups of the University of Twente to research, promote and facilitate big data analysis for all scientific disciplines. The center operates by the participants sharing their expertise, sharing their contacts, sharing their data, and sharing their research infrastructure (hardware and software) for large-scale data analysis.
The Twente Data Science Center offers a unique combination of expertise in computer science, mathematics, management, behavioral sciences and social sciences; collaborations with leading international companies such as Google, Twitter and Yahoo; and local infrastructure and support for the analysis of very large datasets.
We organize a workshop on Heterogeneous Information Access hosted by the 8th International Conference on Web Search and Data Mining on 6 February 2015 in Shanghai, China
Invited speakers: Mounia Lalmas (Yahoo) and Milad Shokouhi, (Microsoft Research)
Information access is becoming increasingly heterogeneous. Especially when the user's information need is for exploratory purpose, returning a set of diverse results from different resources could benefit the user. For example, when a user is planning a trip to China on the Web, retrieving and presenting results from vertical search engines like travel, flight information, map and Q2A sites could satisfy the user's rich and diverse information need. This heterogeneous search aggregation paradigm is useful in many contexts and brings many new challenges.
Aggregated search and composite retrieval are two in- stances of this new heterogeneous information access paradigm. They are applied on the Web with heterogeneous vertical search engines. This paradigm can be useful in many other scenarios: a user aims to re-find comprehensive information about his query in his personal search (emails, slides); or a user searches and gathers different nugget information (e.g. an entity) from a set of RDF Web datasets (e.g., DBpedia, IMDB, etc.); or the user searches a set of different files (e.g., images, documents) in a peer-to-peer online file sharing systems.
This is an emerging area as different services provided are becoming more heterogeneous and complex. Therefore, there are a number of directions that might be interesting for the research and industrial community. How to select the most relevant resources and present them concisely in order to best satisfy the user? How to model the complex user behaviour in this search scenario? How can we evaluate the performance of these systems? Those are a few key interesting research questions to study for heterogeneous information access.
The workshop topics of interest are within the context of heterogeneous information access. They include but are not limited to:
- User modeling for Heterogeneous Information Access, Personalization
- Metrics, measurements, and test collections
- Optimization: Resource and vertical selection, Result presentation and diversification
- Applications: Aggregated/Federated search, Composite retrieval, Structured/Semantic search, P2P search
The workshop includes invited talks by leading researchers in the field from both industry and academia, presentations by contributed submissions as well as organized and open discussion on heterogeneous information access.
More information at: http://hia-workshop.com/.
Thanks everyone for submitting runs to one of the TREC Federated Web Search tasks. We had roughly the same number of participants as last year; not bad, although our goal was to grow. Interestingly, our automatic submission system received an amazing 917 runs.
We discussed the future of the FedWeb track, and we decided that we will not propose a FedWeb 2015 track as coordinators. We were unable to secure funding. Combined with the fact that we created the FedWeb collection for three years in a row (although the first time independently of TREC), we believe it is best to properly finish the TREC this year, but not to run again next year. Read more…
Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Ke Zhou, and Djoerd Hiemstra
by Ke Zhou Thomas Demeester Dong Nguyen, Djoerd Hiemstra, and Dolf Trieschnigg
Selecting and aggregating different types of content from multiple vertical search engines is becoming popular in web search. The user vertical intent, the verticals the user expects to be relevant for a particular information need, might not correspond to the vertical collection relevance, the verticals containing the most relevant content. In this work we propose different approaches to define the set of relevant verticals based on document judgments. We correlate the collection-based relevant verticals obtained from these approaches to the real user vertical intent, and show that they can be aligned relatively well. The set of relevant verticals defined by those approaches could therefore serve as an approximate but reliable ground-truth for evaluating vertical selection, avoiding the need for collecting explicit user vertical intent, and vice versa.
To be presented at the ACM International Conference on Information and Knowledge Management (CIKM 2014) in Shanghai, China on 3-7 November 2014
DesignLab is a university-wide facility for teaching and research, where the unique profile of the University of Twente flourishes. High Tech, Human Touch: Technology for Society. DesignLab is a place for innovation, creativity and inspiration, where talented students and researchers come to seek and drive application of their work, and where industry, government, NGO’s and SME’s explore options to tackle the challenges they face. Researchers bring in promising new knowledge and technologies, while industrial and societal partners bring in real-world challenges. Together, with joint energy and creativity, new solutions and new possibilities will be explored.
More information at: http://www.utwente.nl/designlab/.