Flávio Martins defends PhD thesis on Temporal Models for Microblog Search

Temporal Information Models for Real-Time Microblog Search

by Flávio Martins

Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where it was shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches:
(1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time;
(2) efficient methods for federated query expansion towards the improvement of query meaning; and
(3) exploiting multiple sources towards the detection of temporal query intent.
It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step.

(Photo by @krisztianbalog@twitter.com)

Whom to Follow on Mastodon?

Recommending Users: Whom to Follow on Federated Social Networks

by Jan Trienes, Andrés Torres Cano, and Djoerd Hiemstra

To foster an active and engaged community, social networks employ recommendation algorithms that filter large amounts of contents and provide a user with personalized views of the network. Popular social networks such as Facebook and Twitter generate follow recommendations by listing profiles a user may be interested to connect with. Federated social networks aim to resolve issues associated with the popular social networks – such as large-scale user-surveillance and the miss-use of user data to manipulate elections – by decentralizing authority and promoting privacy. Due to their recent emergence, recommender systems do not exist for federated social networks, yet. To make these networks more attractive and promote community building, we investigate how recommendation algorithms can be applied to decentralized social networks. We present an offline and online evaluation of two recommendation strategies: a collaborative filtering recommender based on BM25 and a topology-based recommender using personalized PageRank. Our experiments on a large unbiased sample of the federated social network Mastodon shows that collaborative filtering approaches outperform a topology-based approach, whereas both approaches significantly outperform a random recommender. A subsequent live user experiment on Mastodon using balanced interleaving shows that the collaborative filtering recommender performs on par with the topology-based recommender.

This paper will be presented at the 17th Dutch-Belgian Information Retrieval workshop in Leiden on 23 November 2018

[download pdf]

Participate in the Dutch-Belgian Information Retrieval Workshop

The The 17th Dutch-Belgian Information Retrieval workshop (DIR 2018) takes place in Leiden on 23 November 2018. DIR has a diverse 1-day programme with 2 keynotes, 5 talks, 7 posters and 4 demos!

The Dutch-Belgian Information Retrieval workshop (DIR) aims to serve as an international platform (with a special focus on the Netherlands and Belgium) for exchange and discussions on research & applications in the field of information retrieval and related fields.

More information at: http://dir2018.nl.

Candy Reebroek graduates on engagement behavior in online brand communities

Understanding engagement behavior in online brand communities : how social identity relates to frequency of interaction and tweet sentiment.

by Candy Reebroek

This study explains engagement behavior in online brand communities based on data of Twitter users who present different types of social identities. For this, we examined fifteen online brand communities that are popular on Twitter and originated from fashion, fast-food, gaming, cars, and sports sectors. In total, 27,143 twitter messages were analyzed from 22,333 unique Twitter users. We used the Twitter user’s profile descriptions to classify their social identity with the help of computational methods such as Machine Learning and Natural Language Processing. To study the engagement behavior of the Twitter users, we calculated the tweets sentiment and the frequency of interaction between Twitter users and online brand communities. We found that tweet sentiment and frequency of interaction vary significantly between different social identity groups when mentioning different online brand communities. This result is important for online brand community managers to understand what kind of Twitter users interact with their online brand community and how these users engage with the community. Right now, they might only investigate demographics about the users but do not consider the user’s self-presentation online. Furthermore, we made a theoretical contribution by including a larger dataset, by performing computational methods and by exploring multiple online brand communities from different sectors.

[download pdf]

Semere Bitew graduates Cum Laude on Logical Structure Extraction of Electronic Documents

Logical Structure Extraction of Electronic Documents Using Contextual Information

by Semere Bitew

Logical document structure extraction refers to the process of coupling the semantic meanings (logical labels) such as title, authors, affiliation, etc., to physical sections in a document. For example, in scientific papers the first paragraph is usually a title. Logical document structure extraction is a challenging natural language processing problem. Elsevier, as one of the biggest scientific publishers in the world, is working on recovering logical structure from article submissions in its project called the Apollo project. The current process in this project requires the involvement of human annotators to make sure logical entities in articles are labelled with correct tags, such as title, abstract, heading, reference-item and so on. This process can be more efficient in producing correct tags and in providing high quality and consistent publishable article papers if it is automated. A lot of research has been done to automatically extract the logical structure of documents. In this thesis, a document is defined as a sequence of paragraphs and recovering the labels for each paragraph yields the logical structure of a document. For this purpose, we proposed a novel approach that combines random forests with conditional random fields (RF-CRFs) and long short-term memory with CRFs (LSTM-CRFs). Two variants of CRFs called linear-chain CRFs (LCRFs) and dynamic CRFs (DCRFs) are used in both of the proposed approaches. These approaches consider the label information of surrounding paragraphs when classifying paragraphs. Three categories of features namely, textual, linguistic and markup features are extracted to build the RF-CRF models. A word embedding is used as an input to build the LSTM-CRF models. Our models were evaluated for extracting reference-items on Elsevier’s Apollo dataset of 146,333 paragraphs. Our results show that LSTM-CRF models trained on the dataset outperform the RF-CRF models and existing approaches. We show that the LSTM component efficiently uses past feature inputs within a paragraph. The CRF component is able to exploit the contextual information using the tag information of surrounding paragraphs. It was observed that the feature categories are complementary. They produce the best performance when all the features are used. On the other hand, this manual feature extraction can be replaced with an LSTM, where no handcrafted features are used, achieving a better performance. Additionally, the inclusion of features generated for the previous and next paragraph as part of the feature vector for classifying the current paragraph improved the performance of all the models.

[download pdf]

Welcome to Foundations of Information Retrieval

Welcome to the course Foundations of Information Retrieval, a new 5 credit course that is based on the first part of last year’s 10 credit course Information Retrieval. We will introduce some exciting new things in the course: This year’s practical assignments are motivated by use cases of the Text Retrieval Conference’ Genomics track. We will use Elasticsearch, one of today’s most used, and most popular open source scalable search systems. The practical assignments use Jupyter notebooks. We hope to see you at the first lecture on Wednesday 5 September at 10:45h.

Check out the Canvas syllabus

Jordy Michorius graduates on Fair Machine Learning

by Jordy Michorius.

In this research an approach for bias reduction, while still maintaining usability of the classifier, is proposed. The approach for bias reduction requires all preprocessing to be done, include one-hot encoding and making the training and test set split. The approach then requires a banned feature, a feature that has for example been deemed morally irrelevant for the classification purpose. For the bias reduction, the proposal is to use the KS-score obtained from the two sample KS-test to determine how well a feature contributes to classification and how well it contributes to the bias of the banned feature. So that means that all features present in the dataset that are not the label(L) or the banned feature(B), that the following holds for feature X to be safe to use in the training dataset:

KS–score(X|L=1, X|L=0) > KS–score(X|B=1, X|B=0)

After all features are checked, the unsafe (or flagged) features need to be removed from both the training and the test set in order to make the classifier as fair as possible. The datasets that have been used are the Titanic dataset, with as banned feature the passenger class and a Financial survey, with as banned feature the race. The results have shown that the overall bias has been reduced for both the Titanic dataset and the Financial survey. However in terms of relative fairness, the Financial survey is the only one that became less fair for a certain banned feature value (Race = White). All other values became fairer for both the Financial survey and the Titanic dataset.

The role of Online Identity on Donations

The role of Online Identity on Donations to Nonprofit Organizations in Online Health Campaigns

by Anna Priante, Ariana Need, Tijs van den Broek, and Djoerd Hiemstra

Nonprofit Organizations largely use social media to mobilize people for social causes and encourage participation in collective action, such as advocacy campaigns. However, little is known about the micro-level mechanisms that drive individual mobilization outcomes that require a substantial effort in participation such as collecting donations during advocacy campaigns. By answering the call to combine motivational and structural factors that explain the mechanisms driving people’s engagement in collective action via social media, we focus on the role of online social identity as a motivator to engage in campaigns, and on individual network positions as opportunity structures for online mobilization. Using the 2014 US Movember health movement campaign on Twitter as an empirical context, we adopt a multi-method approach combining Natural Language Processing, social network analysis and multivariate regression analysis to investigate the effects of online social identity and structural network position on the amount of collected donations for medical research during campaign. We find that only social identities related to occupations and professions have significant effects on the amount of collected donation, whereas network position matters when movement members are central in the communication process because they connect different cohesive subgroups, or communities of the network, characterized by the prevalence of weak ties. We show the importance of integrating the study of identity and network to advance our understanding of online micro-mobilization dynamics. This study offers contributions to research at the intersection of research on the non-profit sector, social movements, media and communication, and health fundraising.

To be presented at the 78th Annual Meeting of the Academy of Management on 14 August 2018 in Chicago, USA

Tweeting about my moustache

How Online Identity influences Collected Donations in Online Health Campaigns

by Anna Priante, Michel Ehrenhard, Tijs van der Broek, Ariana Need, Djoerd Hiemstra

Health advocacy organizations increasingly use social media to engage people in fundraising campaigns for medical research, such as cancer prevention. However, little is known about the effectiveness of online health campaigns and the psychosocial mechanisms that drive people’s voluntary engagement to collect money for medical research. By using identity-based motivation theory from social psychology, we focus on campaign participants’ online occupational identity, such as being a doctor, and how it provides motivation to collect donations. We investigate the mechanisms, such as fundraisers’ Twitter activity as a cognitive process and their central network positions in online communication, that mediate the relationship between identity and donations.

We adopt a multi-method approach combining automatic text analysis, Natural Language Processing from computational linguistics, social network analysis and multivariate regression analysis. Using the 2014 US Movember health movement campaign on Twitter as an empirical context, we find that when people are engaged in health fundraising on Twitter, their success depends on the extent to which they act in occupational identity-congruent ways. In addition, we find that fundraisers’ Twitter activity as a sense-making, cognitive process – and not their central positions in online communication – mediates the relation between identity and donations.

We show the importance of integrating both people’s social identification and cognitive processes into theory and research for a better understanding of how occupational identity matters in online health campaigns. This study offers contributions to research at the intersection of health advocacy, social media use, and, more broadly, online social movements. We conclude by discussing the practical implications of these findings for health advocacy organizations.

To be presented at the 113th Annual Meeting of the American Sociological Association
(ASA 2018) on 11-14 August 2018 in Philadelphia, USA.