by Djoerd Hiemstra
Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.
To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.
[download pdf] [slides]
The past months Searsia investigated ways for search engines to provide search advertisements without participating in the large advertisement networks of Google and Facebook, and more importantly, without the need for search engines to track their users.
Leading People to Longer Queries
by Djoerd Hiemstra, Claudia Hauff, and Leif Azzopardi
People tend to type short queries, however, the belief is that longer queries are more effective. Consequently, a number of attempts have been made to encourage and motivate people to enter longer queries. While most have failed, a recent attempt — conducted in a laboratory setup — in which the query box has a halo or glow effect, that changes as the query becomes longer, has been shown to increase query length by one term, on average. In this paper, we test whether a similar increase is observed when the same component is deployed in a production system for site search and used by real end users. To this end, we conducted two separate experiments, where the rate at which the color changes in the halo were varied. In both experiments users were assigned to one of two conditions: halo and no-halo. The experiments were ran over a fifty day period with 3,506 unique users submitting over six thousand queries. In both experiments, however, we observed no significant difference in query length. We also did not find longer queries to result in greater retrieval performance. While, we did not reproduce the previous findings, our results indicate that the query halo effect appears to be sensitive to performance and task, limiting its applicability to other contexts.
To be presented at SIGIR 2017, the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval in Tokyo, Japan on August 7-11, 2017
Also to be presented at DIR2017, the 16th Dutch-Belgian Information Retrieval Workshop in Hilversum, The Netherlands, on November 24, 2017
Even our smart phones are now powerful enough to search serious-sized document collections, such as personal blogs, sites with software documentation, sites of small and medium-sized enterprises, and even the famous Cranfield collection. In-browser search comes with interesting privacy benefits.
Read more at the Searsia Blog.
Does Google doubt whether the holocaust happened?
Query autocompletion algorithms that are based on query logs are problematic in two important ways: 1) They return offensive and damaging results; 2) They suffer from a destructive feedback loop.
The Dutch chapter of the Internet Society (ISOC) nominated Searsia for its 2017 Innovation Award.
Read more on the Searsia blog
The slides of the CLEF keynote can be downloaded below
A case for search specialization and search delegation
Evaluation conferences like CLEF, TREC and NTCIR are important for the field, and keep being important because there is no “one-size-fits-all” for search engines. Different domains need different ranking approaches: For instance, Web search benefits from analyzing the link graph; Twitter search benefits from retweets and likes; Restaurant search benefits from geo-location and reviews; Advertisement search need bids and click-through, etc. Researching many domains will learn us more about the need and the value of the specialization of search engines, and about approaches that can quickly learn rankings for new domains using for instance learning-to-rank and clever feature selection.
A search engine that provides results from multiple domains, therefore better delegates its queries to specialized search engines. This brings up unique research questions on how to best select a specialized search engine. The TREC Federated Web Search track, that ran in 2013 and 2014, studied these questions in two tasks: the resource selection task studied how to select, given a query but before seeing the results for the query, the top specialized search engines for a query. The vertical selection task studied how to select the top domains from a predefined set of domains such as news, video, Q&A, etc.
I will present the lessons that we learned from running the Federated Web Search track, focusing on successful approaches to resource selection and vertical selection. I will conclude the talk by discussing our steps to take this work to full practice by running the University of Twente's search engine as a federation of more than 30 smaller search engines, including local databases with news, courses, publications, as well as results from social media like Twitter and YouTube. The engine that runs U. Twente search is called Searsia and is available as open source software at: http://searsia.org.
As of this today, the university is using our Distributed Search approach as their main search engine on: http://utwente.nl/search (and also stand-alone on https://search.utwente.nl). The UT search engine offers its user not only the results from a large web crawl, but also live results from many sources that were previously invisible, such as courses, timetables, staff contact information, publications, the local photo database “Beeldbank”, vacancies, etc. The search engine combines about 30 of such sources, and learns over time which sources should be included for a query, even if it has never seen that query, nor the results for the query.
Read more in the official announcement (in Dutch).