Welcome to the course Foundations of Information Retrieval, a new 5 credit course that is based on the first part of last year’s 10 credit course Information Retrieval. We will introduce some exciting new things in the course: This year’s practical assignments are motivated by use cases of the Text Retrieval Conference’ Genomics track. We will use Elasticsearch, one of today’s most used, and most popular open source scalable search systems. The practical assignments use Jupyter notebooks. We hope to see you at the first lecture on Wednesday 5 September at 10:45h.
by Jordy Michorius.
In this research an approach for bias reduction, while still maintaining usability of the classifier, is proposed. The approach for bias reduction requires all preprocessing to be done, include one-hot encoding and making the training and test set split. The approach then requires a banned feature, a feature that has for example been deemed morally irrelevant for the classification purpose. For the bias reduction, the proposal is to use the KS-score obtained from the two sample KS-test to determine how well a feature contributes to classification and how well it contributes to the bias of the banned feature. So that means that all features present in the dataset that are not the label(L) or the banned feature(B), that the following holds for feature X to be safe to use in the training dataset:
KS–score(X|L=1, X|L=0) > KS–score(X|B=1, X|B=0)
After all features are checked, the unsafe (or flagged) features need to be removed from both the training and the test set in order to make the classifier as fair as possible. The datasets that have been used are the Titanic dataset, with as banned feature the passenger class and a Financial survey, with as banned feature the race. The results have shown that the overall bias has been reduced for both the Titanic dataset and the Financial survey. However in terms of relative fairness, the Financial survey is the only one that became less fair for a certain banned feature value (Race = White). All other values became fairer for both the Financial survey and the Titanic dataset.
The role of Online Identity on Donations to Nonprofit Organizations in Online Health Campaigns
by Anna Priante, Ariana Need, Tijs van den Broek, and Djoerd Hiemstra
Nonprofit Organizations largely use social media to mobilize people for social causes and encourage participation in collective action, such as advocacy campaigns. However, little is known about the micro-level mechanisms that drive individual mobilization outcomes that require a substantial effort in participation such as collecting donations during advocacy campaigns. By answering the call to combine motivational and structural factors that explain the mechanisms driving people’s engagement in collective action via social media, we focus on the role of online social identity as a motivator to engage in campaigns, and on individual network positions as opportunity structures for online mobilization. Using the 2014 US Movember health movement campaign on Twitter as an empirical context, we adopt a multi-method approach combining Natural Language Processing, social network analysis and multivariate regression analysis to investigate the effects of online social identity and structural network position on the amount of collected donations for medical research during campaign. We find that only social identities related to occupations and professions have significant effects on the amount of collected donation, whereas network position matters when movement members are central in the communication process because they connect different cohesive subgroups, or communities of the network, characterized by the prevalence of weak ties. We show the importance of integrating the study of identity and network to advance our understanding of online micro-mobilization dynamics. This study offers contributions to research at the intersection of research on the non-profit sector, social movements, media and communication, and health fundraising.
To be presented at the 78th Annual Meeting of the Academy of Management on 14 August 2018 in Chicago, USA
How Online Identity influences Collected Donations in Online Health Campaigns
by Anna Priante, Michel Ehrenhard, Tijs van der Broek, Ariana Need, Djoerd Hiemstra
Health advocacy organizations increasingly use social media to engage people in fundraising campaigns for medical research, such as cancer prevention. However, little is known about the effectiveness of online health campaigns and the psychosocial mechanisms that drive people’s voluntary engagement to collect money for medical research. By using identity-based motivation theory from social psychology, we focus on campaign participants’ online occupational identity, such as being a doctor, and how it provides motivation to collect donations. We investigate the mechanisms, such as fundraisers’ Twitter activity as a cognitive process and their central network positions in online communication, that mediate the relationship between identity and donations.
We adopt a multi-method approach combining automatic text analysis, Natural Language Processing from computational linguistics, social network analysis and multivariate regression analysis. Using the 2014 US Movember health movement campaign on Twitter as an empirical context, we find that when people are engaged in health fundraising on Twitter, their success depends on the extent to which they act in occupational identity-congruent ways. In addition, we find that fundraisers’ Twitter activity as a sense-making, cognitive process – and not their central positions in online communication – mediates the relation between identity and donations.
We show the importance of integrating both people’s social identification and cognitive processes into theory and research for a better understanding of how occupational identity matters in online health campaigns. This study offers contributions to research at the intersection of health advocacy, social media use, and, more broadly, online social movements. We conclude by discussing the practical implications of these findings for health advocacy organizations.
To be presented at the 113th Annual Meeting of the American Sociological Association
(ASA 2018) on 11-14 August 2018 in Philadelphia, USA.
Welcome to Research Experiments in Databases and Information Retrieval (REDI)! The theme of this year’s course is: Recommendation in federated social networks. Federated social networks consist of multiple independent servers that cooperate. An example is Mastodon, a free open source implementation of a micro-blogging social network that resembles Twitter. Unlike Twitter (or Facebook for that matter), nobody has a complete view of all accounts and posts in a federated social network. We will address two research problems: 1) How to implement recommendations using only local knowledge of the network? and 2) How to evaluate your system in such a highly dynamic environment?
We are the first University of Twente course with a public Canvas syllabus. Of course, we will appropriately use Mastodon to communicate about REDI. Please make an account on mastodon.utwente.nl and follow the hash tag #REDI. Use the hash tag in questions and toots about the course.
The Data Management and Biometrics group and Formal Methods & Tools groups at the University of Twente seek a PhD candidate for SEQUOIA: Smart maintenance optimization via big data & fault tree analysis, a project funded by the NWO Applied and Engineering Sciences, and the companies ProRail and NS. ProRail is responsible for the Dutch railway network, including its construction, management, maintenance, and safety; NS has the same responsibility for the Dutch train fleed. The project is led by Mariëlle Stoelinga, Joost-Pieter Katoen and Djoerd Hiemstra.
SEQUOIA aims to improve the reliability of the Dutch railroads by deploying big data analytics to predict and prevent failures. Its scientific core is a novel combination of machine learning, fault tree analysis and stochastic model checking. Key idea is that big data analytics provide the statistics on failures, their correlations, dependencies etc. and fault trees provide the domain knowledge needed to interpret these data. The project outcome aims at developing explainable machine learning techniques that discover causal relations instead of statistical correlations; machine learning of fault trees or of other models that are normally designed top-down by domain experts. The techniques should help ProRail to decrease train disruptions and delays, to lower maintenance cost, and to increase passenger comfort.
The project involves an intense cooperation ProRail and the RWTH Aachen University. The PhD candidate will spend a portion of their time at ProRail. Key project deliverables are efficient analysis algorithms and a workable tool to be used in the ProRail context. For more information, see:
by Johannes Wassenaar
Linking segments of video using text-based methods and a flexible form of segmentation
In order to let user’s explore, and use large archives, video hyperlinking tries to aid the user in linking segments of video to other segments of videos, similar to the way hyperlinks on the web are used – instead of using a regular search tool. Indexing, querying and re-ranking multimodal data, in this case video’s, are subjects common in the video hyperlinking community. A video hyperlinking system contains an index of multimodal (video) data, while the currently watched segment is translated into a query, the query generation phase. Finally, the system responds to the user with a ranked list of targets that are about the anchor segment. In this study, the payload of terms in the form of position and offset in Elastic Search are used to obtain time-based information along the speech transcripts to link users directly to spoken text. The queries are generated by a statistic-based method using TF-IDF, a grammar-based part-of-speech tagger or a combination of both. Finally, results are ranked by weighting specific components and cosine similarity. The system is evaluated with the Precision at 5 and MAiSP measures, which are used in the TRECVid benchmark on this topic. The results show that TF-IDF and the cosine similarity work the best for the proposed system.
The University of Twente is the first Dutch university to run its own Mastodon server. Mastodon is a social network based on open web protocols and free, open-source software. It is decentralized like e-mail. Learning from failures of other networks, Mastodon aims to make ethical design choices to combat the misuse of social media. By joining U. Twente Mastodon, you join a global social network with more than a million people. The university will not sell your data, nor show you advertisements. Mastodon U. Twente is available to all students, alumni, and employees.
Automatic Product Name Recognition from Short Product Descriptions
by Elnaz Pazhouhi
This thesis studies the problem of product name recognition from short product descriptions. This is an important problem especially with the increasing use of ERP (Enterprise Resource Planning) software at the core of modern business management systems, where the information of business transactions is stored in unstructured data stores. A solution to the problem of product name recognition is especially useful for the intermediate businesses as they are interested in finding potential matches between the items in product catalogs (produced by manufactures or another intermediate business) and items in the product requests (given by the end user or another intermediate business).
In this context the problem of product name recognition in specifically challenging because product descriptions are typically short, ungrammatical, incomplete, abbreviated and multilingual. In this thesis we investigate the application of supervised machine-learning techniques and gazetteer-based techniques to our problem. To approach the problem, we define it as a classification problem where the tokens of product descriptions are classified into I, O and B classes according to the standard IOB tagging scheme. Next we investigate and compare the performance of a set of hybrid solutions that combine machine learning and gazetteer-based approaches. We study a solution space that uses four learning models: linear and non-linear SVC, Random Forest, and AdaBoost. For each solution, we use the same set of features. We divide the features into four categories: token-level features, document-level features, gazetteer-based features and frequency-based features. Moreover, we use automatic feature selection to reduce the dimensionality of data; that consequently improves the training efficiency and avoids over-fitting.
To be able to evaluate the solutions, we develop a machine learning framework that takes as its inputs a list of predefined solutions (i.e. our solution space) and a preprocessed labeled dataset (i.e. a feature vector X, and a corresponding class label vector Y). It automatically selects the optimal number of most relevant features, optimizes the hyper-parameters of the learning models, trains the learning models, and evaluates the solution set. We believe that our automated machine learning framework, can effectively be used as an AutoML framework that automates most of the decisions that have to be made in the design process of a machine learning solution for a particular domain (e.g. for product name recognition).
Moreover, we conduct a set of experiments and based on the results, we answer the research questions of this thesis. In particular, we determine (1) which learning models are more effective for our task, (2) which feature groups contain the most relevant features (3) what is the contribution of different feature groups to the overall performance of the induced model, (4) how gazetteer-based features are incorporated with the machine learning solutions, (5) how effective gazetteer-based features are, (6) what the role of hyper-parameter optimization is and (7) which models are more sensitive to the hyper-parameters optimization.
According to our results, the solutions with maximum and minimum performance are non-linear SVC with an F1 measure of 65% and AdaBoost with an F1 measure of 59% respectively. This reveals that the role of classifiers is not considerable in the final outcome of the learning model, at least according to the studied dataset. Additionally, our results show that the most effective feature group is the document-level features with 14.8% contribution to the overall performance (i.e. F1 measure), in the second position, there is the group of token-level features, with 6.8% contribution. The other two groups, the gazetteer-based features and frequency-based features have small contributions of 1% and 0.5% respectively. However more investigations relate the poor performance of gazetteer-based features to the low coverage of the used gazetteer (i.e. ETIM).
Our experiments also show that all learning models over-fit the training data when a large number of features is used; thus the use of feature selection techniques is essential to the robustness of the proposed solutions. Among the studied learning models, the performance of non-linear SVC and AdaBoost models strongly depends on the used hyper-parameters. Therefore for those models the computational cost of the hyper-parameters tuning is justifiable.