Alisa Rieger defends PhD thesis on responsible opinion formation

Striving for responsible opinion formation in web search on debated topics

by Alisa Rieger

Web search plays an important role in the contemporary information landscape, shaping individual and collective knowledge by providing fast and effortless access to vast amounts of resources. We rely on web search engines for various information needs, some of which can carry serious consequences. This is particularly evident when searching for information on debated topics, which can shape opinions and practical decisions. Debated topics are characterized by diverse and often opposing perspectives linked to different values and interests. Ideally, individuals would diligently engage with different perspectives to become well-informed and form opinions responsibly. However, engaging with information on debated topics can be cognitively demanding and subject to emotionally charged and biased behavior. When resorting to web search to find information on debated topics, searchers may be confronted with further obstacles. For instance, search engines are known to apply opaque ranking criteria, may not provide sufficient viewpoint diversity, and might foster over-reliance.

In this dissertation, we present different user studies aimed at better understanding the challenges of web search on debated topics and identifying measures to help searchers overcome these challenges. We first explored whether and how factors inherent to the searcher and search interface affect search behavior. Then, we investigated the risks and benefits of interventions to guide search behavior as well as empower searchers, aiming at supporting unbiased and diligent search interactions without restricting searcher autonomy. Our findings underscore the unique characteristics of web search on debated topics and provide a foundation for designing, tailoring, and evaluating interventions to support searchers. Considering the overall insights gained through our user studies, it becomes clear that the most pivotal challenges of web search on debated topics arise from the complex searcher-system interplay. Rather than turning to simple fixes, there is a need to acknowledge the complexity of the issue and commit to comprehensive investigations and solutions to avoid inadvertently exacerbating risks. Laying the groundwork for future investigations, we provide an extensive review of interdisciplinary literature with a detailed account of challenges and research opportunities.

With this dissertation, we raise awareness for the pressing socio-technical issues related to digital media and opinion formation and aspire to encourage interdisciplinary research teams, practitioners, and policymakers to join forces in establishing web search environments that foster individual and societal well-being.

[more information]

Zaheer Babar defends PhD thesis on radiology report generation systems

Evaluating the impact of Radiology Reports Structure on AI-Powered Radiology Report Generation Systems

by Zaheer Babar

Radiology reports play an essential role in diagnosing and monitoring various diseases and conditions, from pneumonia to lung cancer and bone conditions. The ability to convey findings clearly and comprehensively is paramount, and producing well-structured, clear, and clinically well-focused radiology reports is essential for high-quality patient diagnosis and care. High-quality patient diagnosis and care can be achieved using a computer-aided radiology report system, which assists radiologists in producing well-structured, clear, and clinically well-focused radiology reports. Deep learning has made significant strides in image caption generation, but it has remained a highly challenging task in the medical domain.
One main challenge is understanding and linking complicated medical observations detected in given images with accurate natural language descriptions. Radiologists follow a standard way of writing these reports, describing a fixed set of diseases and conditions, indicating whether it is normal or abnormal. As a result, medical reports usually overlap with each other due to the common content of anatomy. This standardized way of reporting makes it challenging for the machine learning model to capture the prominent problems and abnormalities indicated in radiology reports. This impact can be felt across various aspects of the task, ranging from the utilization of validation metrics to the performance of the model and the use of different components within it. In this thesis, we study this impact on different levels and demonstrate that our research will lead to reliable progress in automatic radiology report generation.

[more information]

Nirmal Roy defends PhD thesis on the effects of interfaces on search

Exploring the effects of interactive interfaces on user search behaviour

by Nirmal Roy

Interactive information retrieval (IIR) is a user-centered approach to information seeking and retrieval. In this paradigm, the search process is not confined to a single query and a static set of results. Instead, it emphasises the active involvement of users in refining their information needs, iteratively modifying queries, and exploring retrieved content. IIR studies research how to facilitate a more tailored and practical search experience, adapting to the evolving requirements and preferences of users. In this thesis, we focus on four distinct yet interrelated areas in the domain of IIR to have a better understanding of the interaction between the user and the information retrieval system.

[Read more]

Semere Bitew defends PhD thesis on Language Models for Education

Language Model Adaptation with Applications in AI for Education

by Semere Kiros Bitew

The overall theme of my dissertation is in adapting language models mainly for applications in AI in education to automatically create educational content. It addresses the challenges in formulating test and exercise questions in educational settings, which traditionally require significant training, experience, time, and resources. This is particularly critical in high-stakes environments like certifications and tests, where questions cannot be reused. In particular, the primary research is focused on two educational tasks: distractor generation and gap-filling exercise generation. Distractor generation task refers to generating plausible but incorrect answers in multiple-choice questions, while gap-filling exercise generation refers to inducing well-chosen gaps to generate grammar exercises from existing texts. These tasks, although extensively researched, present unexplored avenues that recent advancements in language models can address. As a secondary objective, I explore the adaptation of coreference resolution to new languages. Coreference resolution is a key NLP task that involves clustering mentions in a text that refer to the same real-world entities, a process vital for understanding and generating coherent language.

Read more

Felipe Moraes Gomes defends PhD thesis on Collaborative Search

Examining the Effectiveness of Collaborative Search Engines

by Felipe Moraes Gomes

Although searching is often seen as a solitary activity, searching in collaboration with others is deemed useful or necessary in many complex situations such as: travel planning; online shopping; looking for health related information; planning birthday parties; working on a group project; or finding a house to buy. Researchers have found that complex search tasks can be executed more effectively and efficiently, achieve higher material coverage, and enable higher knowledge gains in an explicit collaborative setting than if conducted in isolation. However, even though researchers have carefully designed several Collaborative Search (CSE) user studies, there is still conflicting evidence or a lack of evidence on the effectiveness of CSE systems. Thus, in this thesis, we focus on examining the effectiveness of CSE systems in two parts.

In the first part, we shed light on the effectiveness of CSE to support two group configurations, namely group sizes and users’ roles. Past collaborative search studies have had a strong focus on groups of two or three collaborators, thus naturally limiting the number of experimental conditions that could increase quickly. Therefore, there is a lack of evidence suggesting the extent to which
a CSE system can support group sizes beyond these commonly investigated group sizes. Thus, in Chapter 3, we study CSE system effectiveness with group size as the primary dependent variable. Here, we vary group sizes from two to six collaborators, with six as our upper bound due to limitations on our available resources.

In Chapter 4, we focus on roles in CSE. Roles can determine how a group splits up the search task, and determines each group member’s function (e.g., one group member is responsible for finding documents and reading and evaluating them, with a further member responsible for in-depth reading and evaluating of the aforementioned documents). In particular, when the CSE system assigns a role to each group member, researchers have hypothesised that a group may reduce the time spent communicating and coordinating the task, and make the search process more efficient and successful than groups without
role assignment. However, past user studies have provided contradicting evidence as to the utility of assigned roles in CSE. Thus, in Chapter 4, we provide more evidence to settle the question of the effectiveness of CSE systems when used by groups with pre-assigned roles versus groups without pre-assigned roles.

In the second part of this thesis, we make our group configurations constant, particularly, group sizes are set to up to three people, and group members receive the same role. We then turn to a different perspective and focus on examining the effectiveness in two contexts: Search as Learning (SAL) and collaborative online shopping. Search activities for human learning involve multiple iterations that require cognitive processing and interpretation, often requiring the searcher to spend time scanning/viewing, comparing, and evaluating information. However, web search engines are not built to support users in the search tasks often required in learning situations. When people use search as a learning activity, it can be an individual activity or a collaborative activity (e.g., group projects). Hence, in Chapter 5, we tackle the challenge of identifying the impact of web search engines on the (single-search or collaborative search) users’ ability to learn compared to learning acquired via high-quality learning materials as a baseline.

In Chapter 6, we look at a further context: collaborative online shopping. In collaborative online shopping, a group of people come together to make a decision to purchase a product that meets the various group members’ requirements and opinions. While shopping together, search is an important part of the task in order to search for products in a catalogue that is available in an e-commerce website. One important aspect of collaborative shopping is supporting awareness and sharing of knowledge as it can enable a sense of co-presence, which helps groups make a decision that satisfies each group member’s requirements and wishes. As search is a significant part of a collaborative online shopping experience, CSE systems are suitable for executing such tasks. However, there is insufficient evidence of how well can CSE systems support a group of users to search for online products together and make a group decision. Hence, in Chapter 6, we explore the effects of increased awareness and sharing of knowledge (co-presence) using a CSE system in collaborative shopping on the group decision making process.

[more info]

Chang Li defends PhD thesis on Optimizing Ranking Systems Online as Bandits

Optimizing Ranking Systems Online as Bandits

by Chang Li

People use interactive systems, such as search engines, as the main tool to obtain information. To satisfy the information needs, such systems usually provide a list of items that are selected out of a large candidate set and then sorted in the decreasing order of their usefulness. The result lists are generated by a ranking algorithm, called ranker, which takes the request of user and candidate items as the input and decides the order of candidate items. The quality of these systems depends on the underlying rankers.

There are two main approaches to optimize the ranker in an interactive system: using data annotated by humans or using the interactive user feedback. The first approach has been widely studied in history, also called offline learning to rank, and is the industry standard. However, the annotated data may not well represent information needs of users and are not timely. Thus, the first approaches may lead to suboptimal rankers. The second approach optimizes rankers by using interactive feedback. This thesis considers the second approach, learning from the interactive feedback. The reasons are two-fold:

  1. Everyday, millions of users interact with the interactive systems and generate a huge number of interactions, from which we can extract the information needs of users.
  2. Learning from the interactive data have more potentials to assist in designing the online algorithms.

Specifically, this thesis considers the task of learning from the user click feedback. The main contribution of this thesis is proposing a safe online learning to re-rank algorithm, named BubbleRank, which addresses one main disadvantage of online learning, i.e., the safety issue, by combining the advantages of both offline and online learning to rank algorithms. The thesis also proposes three other online algorithms, each of which solves unique online ranker optimization problems. All the proposed algorithms are theoretically sound and empirically effective.

[download pdf]


Image by @mdr@twitter.com.

Abhishta defends PhD thesis on the impacts of DDoS attacks

The Blind Man and the Elephant: Measuring Economic Impacts of DDoS Attacks

by Abhishta

Internet has become an important part of our everyday life. We use services like Netflix, Skype, online banking and scopus etc. daily. We even use Internet for filing our taxes and communicating with municipality. This dependency on network-based technologies also provides an opportunity to malicious actors in our society to remotely attack IT infrastructure. One such cyberattack that may lead to unavailability of network resources is known as distributed denial of service (DDoS) attack. A DDoS attack leverages many computers to launch a coordinated Denial of Service attack against one or more targets.
These attacks cause damages to victim businesses. According to reports published by several consultancies and security companies these attacks lead to millions of dollars in losses every year. One might ponder: are the damages caused by temporary unavailability of network services really this large? One of the points of criticism for these reports has been that they often base their findings on victim surveys and expert opinions. Now, as cost accounting/book keeping methods are not focused on measuring the impact of cyber security incidents, it is highly likely that surveys are unable to capture the true impact of an attack. A concerning fact is that most C-level managers make budgetary decisions for security based on the losses reported in these surveys. Several inputs for security investment decision models such as return on security investment (ROSI) also depend on these figures. This makes the situation very similar to the parable of the blind men and the elephant, who try to conceptualise how the elephant looks like by touching it. Hence, it is important to develop methodologies that capture the true impact of DDoS attacks. In this thesis, we study the economic impact of DDoS attacks on public/private organisations by using an empirical approach.

[download thesis]

Flávio Martins defends PhD thesis on Temporal Models for Microblog Search

Temporal Information Models for Real-Time Microblog Search

by Flávio Martins

Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where it was shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches:
(1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time;
(2) efficient methods for federated query expansion towards the improvement of query meaning; and
(3) exploiting multiple sources towards the detection of temporal query intent.
It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step.

(Photo by @krisztianbalog@twitter.com)

Cum laude degree for Masrour Zoghi

Dueling bandits for online ranker evaluation

by Masrour Zoghi

In every domain where a service or a product is provided, an important question is that of evaluation: given a set of possible choices for deployment, what is the best one? An important example, which is considered in this work, is that of ranker evaluation from the field of information retrieval (IR). The goal of IR is to satisfy the information need of a user in response to a query issued by them, where this information need is typically satisfied by a document (or a small set of documents) contained in what is often a much larger collection. This goal is often attained by ranking the documents according to their usefulness to the issued query using an algorithm, called a ranker, a procedure that takes as input a query and a set of documents and specifies how the documents need to be ordered.
This thesis is concerned with ranker evaluation. The goal of ranker evaluation is to determine the quality of the rankers under consideration to allow us to choose the best option: given a finite set of possible rankers, which one of them leads to the highest level of user satisfaction? There are two main methods for carrying this out: absolute metrics and relative comparisons. This thesis is concerned with the second, relative form of ranker evaluation because it is more efficient at distinguishing between rankers of different quality: for instance interleaved comparisons take a fraction of the time required by A/B testing, but they produce the same outcome. More precisely, the problem of online ranker evaluation from relative feedback can be described as follows: given a finite set of rankers, choose the best using only pairwise comparisons between the rankers under consideration, while minimizing the number of comparisons involving sub-optimal rankers. This problem is an instance of what is referred to as the dueling bandit problem in the literature.
The main contribution of this thesis is devising a dueling bandit algorithm, called Copeland Confidence Bounds (CCB), that solves this problem under practically general assumptions and providing theoretical guarantees for its proper functioning. In addition to that, the thesis contains a number of other algorithms that are better suited for dueling bandit problems with particular properties.

[download pdf]

Mohammad Khelghati defends PhD thesis on Deep Web Entity Monitoring

by Mohammadreza Khelghati

Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.

[download pdf]