Alexandru Serban graduates on Personalized Ranking in Academic Search

Context Based Personalized Ranking in Academic Search

by Alexandru Serban

A criticism of search engines is that queries return the same results for users who send exactly the same query, with distinct information needs. Personalized search is considered a solution as search results are re-evaluated based on user preferences or activity. Instead of relying on the unrealistic assumption that people will precisely specify their intent when searching, the user profile is exploited to re-rank the results. This thesis focuses on two problems related to academic information retrieval systems. The first part is dedicated to data sets for search engine evaluation. Test collections consists of documents, a set of information needs, also called topics, queries that represent the data structure sent to the information retrieval tool and relevance judgements for the top documents retrieved from the collection. Relevance judgements are difficult to gather because the process involves manual work. We propose an automatic method to generate queries from the content of a scientific article and evaluate the relevant results. A test collection is generated, but its power to discriminate between relevant and non relevant results is limited. In the second part of the thesis Scopus performance is improved through personalization. We focus on the academic background of researchers that interact with Scopus since information about their academic profile is already available. Two methods for personalized search are investigated.
At first, the connections between academic entities, expressed as a graph structure, are used to evaluate how relevant a result is to the user. We use SimRank, a similarity measure for entities based on their relationships with other entities. Secondly, the semantic structure of documents is exploited to evaluate how meaningful a document is for the user. A topic model is trained to reflect the user’s interests in research areas and how relevant the search results are.
In the end both methods are merged with the initial Scopus rank. The results of a user study show a constant performance increase for the first 10 results.

[download pdf]

Bas Niesink graduates on biomedical information retrieval

Improving biomedical information retrieval with pseudo and explicit relevance feedback

by Bas Niesink

The HERO project aims to increase the quality of supervised exercise during cancer treatment by making use of a clinical decision support system. In this research, concept-based information retrieval techniques to find relevant medical publications for such a system were developed and tested. These techniques were designed to search multiple document collections, without the need to store copies of the collections.
The influence of pseudo and explicit relevance feedback using the Rocchio algorithm were explored. The underlying retrieval models that were tested are TFIDF and BM25.
The tests were conducted using the TREC Clinical Decision Support datasets for the 2014 and 2015 editions. The TREC CDS relevance judgements were used to simulate explicit feedback. The NLM Medical Text Indexer was used to extract MeSH terms from the TREC CDS topics, to be able to conduct concept-based queries. Furthermore, the difference in performance when using inverse document frequencies calculated on the entire PMC dataset, and on a collection of several thousand intermediate search results were measured.
The results show that both pseudo and explicit relevance feedback have a strong positive influence on the inferred NDCG. Additionally, the performance difference when using IDF values calculated on a very small document collection is limited.

[download pdf]

Term Extraction paper in Computing Reviews’ Best of 2016

CR Best of Computing Notable Article The paper Evaluation and analysis of term scoring methods for term extraction with Suzan Verberne, Maya Sappelli and Wessel Kraaij is selected as one of ACM Computing Reviews' 2016 Best of Computing. Computing Reviews is published by the Association for Computing Machinery (ACM) and the editor-in-chief is Carol Hutchins (New York University).

In the paper, we evaluate five term scoring methods for automatic term extraction on four different types of text collections. We show that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

[download pdf]

SIGIR Test of Time Awardees 1978-2001

Overview of Special Issue

by Donna Harman, Diane Kelly (Editors), James Allan, Nicholas J. Belkin, Paul Bennett, Jamie Callan, Charles Clarke, Fernando Diaz, Susan Dumais, Nicola Ferro, Donna Harman, Djoerd Hiemstra, Ian Ruthven, Tetsuya Sakai, Mark D. Smucker, Justin Zobel (Authors)

This special issue of SIGIR Forum marks the 40th anniversary of the ACM SIGIR Conference by showcasing papers selected for the ACM SIGIR Test of Time Award from the years 1978-2001. These papers document the history and evolution of IR research and practice, and illustrate the intellectual impact the SIGIR Conference has had over time.
The ACM SIGIR Test of Time Award recognizes conference papers that have had a long-lasting influence on information retrieval research. When the award guidelines were created, eligible papers were identified as those that were published in a window of time 10 to 12 years prior to the year of the award. This meant that the first year this award was given, 2014, eligible papers came from the years 2002-2004. To identify papers published during the period 1978-2001 that might also be recognized with the Test of Time Award, a committee was created, which was led by Keith van Rijsbergen. Members of the committee were: Nicholas Belkin, Charlie Clarke, Susan Dumais, Norbert Fuhr, Donna Harman, Diane Kelly, Stephen Robertson, Stefan Rueger, Ian Ruthven, Tetsuya Sakai, Mark Sanderson, Ryen White, and Chengxiang Zhai.
The committee used citation counts and other techniques to build a nomination pool. Nominations were also solicited from the community. In addition, a sub-committee was formed of people active in the 1980s to identify papers from the period 1978-1989 that should be recognized with the award. As a result of these processes, a nomination pool of papers was created and each paper in the pool was reviewed by a team of three committee members and assigned a grade. The 30 papers with the highest grades were selected to be recognized with an award.
To commemorate the 1978-2001 ACM SIGIR Test of Time awardees, we invited a number of people from the SIGIR community to contribute write-ups of each paper. Each write-up consists of a summary of the paper, a description of the main contributions of the paper and commentary on why the paper is still useful. This special issue contains reprints of all the papers, with the exception of a few whose copyrights are not held by ACM (members of ACM can access these papers at the ACM Digital Library as part of the original conference proceedings).
As members of the selection committee, we really enjoyed reading the older papers. The style was very different from todays SIGIR paper: the writing was simple and unpretentious, with an equal mix of creativity, rigor and openness. We encourage everyone to read at least a handful of these papers and to consider how things have changed, and if, and how, we might bring some of the positive qualities of these older papers back to the SIGIR program.

To be published in SIGIR Forum 51(2), Association for Computing Machinery, July 2017

[download pdf]

Exploring the Query Halo Effect in Site Search

Leading People to Longer Queries

by Djoerd Hiemstra, Claudia Hauff, and Leif Azzopardi

People tend to type short queries, however, the belief is that longer queries are more effective. Consequently, a number of attempts have been made to encourage and motivate people to enter longer queries. While most have failed, a recent attempt — conducted in a laboratory setup — in which the query box has a halo or glow effect, that changes as the query becomes longer, has been shown to increase query length by one term, on average. In this paper, we test whether a similar increase is observed when the same component is deployed in a production system for site search and used by real end users. To this end, we conducted two separate experiments, where the rate at which the color changes in the halo were varied. In both experiments users were assigned to one of two conditions: halo and no-halo. The experiments were ran over a fifty day period with 3,506 unique users submitting over six thousand queries. In both experiments, however, we observed no significant difference in query length. We also did not find longer queries to result in greater retrieval performance. While, we did not reproduce the previous findings, our results indicate that the query halo effect appears to be sensitive to performance and task, limiting its applicability to other contexts.

To be presented at SIGIR 2017, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval in Tokyo, Japan on August 7-11, 2017

Also to be presented at DIR2017, the 16th Dutch-Belgian Information Retrieval Workshop in Hilversum, The Netherlands, on November 24, 2017

[download pdf]

Greetings from CuriousU Search Engine Technology

CuriousU Search Engine Technology will explore the world of search engines. You will learn how search engines work, what challenges they deal with, and how their performance can be measured. And even beter: you will be guided in building, evaluating, and improving your own search engine on a real-world dataset.

To be presented at CuriousU Summer School 2017 13 – 22 August, 2017 at the University of Twente.

Private search in the browser

Even our smart phones are now powerful enough to search serious-sized document collections, such as personal blogs, sites with software documentation, sites of small and medium-sized enterprises, and even the famous Cranfield collection. In-browser search comes with interesting privacy benefits.

Read more at the Searsia Blog.

Slavica Zivanovic graduates on capturing and mapping QOL using Twitter data

by Slavica Zivanovic

There is an ongoing discussion about the applicability of social media data in scientific research. Moreover, little is known about the feasibility to use these data to capture the Quality of Life (QoL). This study explores the use of social media in QoL research by capturing and analysing people’s perceptions about their QoL using Twitter messages. The methodology is based on a mixed method approach, combining manual coding of the messages, automated classification, and spatial analysis. The city of Bristol is used as a case study, with a dataset containing 1,374,706 geotagged Tweets sent within the city boundaries in 2013. Based on the manual coding results, health, transport, and environment domains were selected to be further analysed. Results show the difference between Bristol wards in number and type of QoL perceptions in every domain, spatial distribution of positive and negative perceptions, and differences between the domains. Furthermore, results from this study are compared to the official QoL survey results from Bristol, statistically and spatially. Overall, three main conclusions are underlined. First, Twitter data can be used to evaluate QoL. Second, based on people’s opinions, there is a difference in QoL between Bristol neighbourhoods. And, third, Twitter messages can be used to complement QoL surveys but not as a proxy. The main contribution of this study is in recognising the potential Twitter data have in QoL research. This potential lies in producing additional knowledge about QoL that can be placed in a planning context and effectively used to improve the decision-making process and enhance quality of life of residents.

[download pdf]