Welcome to Research Experiments in Databases and Information Retrieval (REDI)! The theme of this year’s course is: Recommendation in federated social networks. Federated social networks consist of multiple independent servers that cooperate. An example is Mastodon, a free open source implementation of a micro-blogging social network that resembles Twitter. Unlike Twitter (or Facebook for that matter), nobody has a complete view of all accounts and posts in a federated social network. We will address two research problems: 1) How to implement recommendations using only local knowledge of the network? and 2) How to evaluate your system in such a highly dynamic environment?
We are the first University of Twente course with a public Canvas syllabus. Of course, we will appropriately use Mastodon to communicate about REDI. Please make an account on mastodon.utwente.nl and follow the hash tag #REDI. Use the hash tag in questions and toots about the course.
The Data Management and Biometrics group and Formal Methods & Tools groups at the University of Twente seek a PhD candidate for SEQUOIA: Smart maintenance optimization via big data & fault tree analysis, a project funded by the NWO Applied and Engineering Sciences, and the companies ProRail and NS. ProRail is responsible for the Dutch railway network, including its construction, management, maintenance, and safety; NS has the same responsibility for the Dutch train fleed. The project is led by Mariëlle Stoelinga, Joost-Pieter Katoen and Djoerd Hiemstra.
SEQUOIA aims to improve the reliability of the Dutch railroads by deploying big data analytics to predict and prevent failures. Its scientific core is a novel combination of machine learning, fault tree analysis and stochastic model checking. Key idea is that big data analytics provide the statistics on failures, their correlations, dependencies etc. and fault trees provide the domain knowledge needed to interpret these data. The project outcome aims at developing explainable machine learning techniques that discover causal relations instead of statistical correlations; machine learning of fault trees or of other models that are normally designed top-down by domain experts. The techniques should help ProRail to decrease train disruptions and delays, to lower maintenance cost, and to increase passenger comfort.
The project involves an intense cooperation ProRail and the RWTH Aachen University. The PhD candidate will spend a portion of their time at ProRail. Key project deliverables are efficient analysis algorithms and a workable tool to be used in the ProRail context. For more information, see:
by Johannes Wassenaar
Linking segments of video using text-based methods and a flexible form of segmentation
In order to let user’s explore, and use large archives, video hyperlinking tries to aid the user in linking segments of video to other segments of videos, similar to the way hyperlinks on the web are used – instead of using a regular search tool. Indexing, querying and re-ranking multimodal data, in this case video’s, are subjects common in the video hyperlinking community. A video hyperlinking system contains an index of multimodal (video) data, while the currently watched segment is translated into a query, the query generation phase. Finally, the system responds to the user with a ranked list of targets that are about the anchor segment. In this study, the payload of terms in the form of position and offset in Elastic Search are used to obtain time-based information along the speech transcripts to link users directly to spoken text. The queries are generated by a statistic-based method using TF-IDF, a grammar-based part-of-speech tagger or a combination of both. Finally, results are ranked by weighting specific components and cosine similarity. The system is evaluated with the Precision at 5 and MAiSP measures, which are used in the TRECVid benchmark on this topic. The results show that TF-IDF and the cosine similarity work the best for the proposed system.
The University of Twente is the first Dutch university to run its own Mastodon server. Mastodon is a social network based on open web protocols and free, open-source software. It is decentralized like e-mail. Learning from failures of other networks, Mastodon aims to make ethical design choices to combat the misuse of social media. By joining U. Twente Mastodon, you join a global social network with more than a million people. The university will not sell your data, nor show you advertisements. Mastodon U. Twente is available to all students, alumni, and employees.
Join Mastodon U. Twente now
Automatic Product Name Recognition from Short Product Descriptions
by Elnaz Pazhouhi
This thesis studies the problem of product name recognition from short product descriptions. This is an important problem especially with the increasing use of ERP (Enterprise Resource Planning) software at the core of modern business management systems, where the information of business transactions is stored in unstructured data stores. A solution to the problem of product name recognition is especially useful for the intermediate businesses as they are interested in finding potential matches between the items in product catalogs (produced by manufactures or another intermediate business) and items in the product requests (given by the end user or another intermediate business).
In this context the problem of product name recognition in specifically challenging because product descriptions are typically short, ungrammatical, incomplete, abbreviated and multilingual. In this thesis we investigate the application of supervised machine-learning techniques and gazetteer-based techniques to our problem. To approach the problem, we define it as a classification problem where the tokens of product descriptions are classified into I, O and B classes according to the standard IOB tagging scheme. Next we investigate and compare the performance of a set of hybrid solutions that combine machine learning and gazetteer-based approaches. We study a solution space that uses four learning models: linear and non-linear SVC, Random Forest, and AdaBoost. For each solution, we use the same set of features. We divide the features into four categories: token-level features, document-level features, gazetteer-based features and frequency-based features. Moreover, we use automatic feature selection to reduce the dimensionality of data; that consequently improves the training efficiency and avoids over-fitting.
To be able to evaluate the solutions, we develop a machine learning framework that takes as its inputs a list of predefined solutions (i.e. our solution space) and a preprocessed labeled dataset (i.e. a feature vector X, and a corresponding class label vector Y). It automatically selects the optimal number of most relevant features, optimizes the hyper-parameters of the learning models, trains the learning models, and evaluates the solution set. We believe that our automated machine learning framework, can effectively be used as an AutoML framework that automates most of the decisions that have to be made in the design process of a machine learning solution for a particular domain (e.g. for product name recognition).
Moreover, we conduct a set of experiments and based on the results, we answer the research questions of this thesis. In particular, we determine (1) which learning models are more effective for our task, (2) which feature groups contain the most relevant features (3) what is the contribution of different feature groups to the overall performance of the induced model, (4) how gazetteer-based features are incorporated with the machine learning solutions, (5) how effective gazetteer-based features are, (6) what the role of hyper-parameter optimization is and (7) which models are more sensitive to the hyper-parameters optimization.
According to our results, the solutions with maximum and minimum performance are non-linear SVC with an F1 measure of 65% and AdaBoost with an F1 measure of 59% respectively. This reveals that the role of classifiers is not considerable in the final outcome of the learning model, at least according to the studied dataset. Additionally, our results show that the most effective feature group is the document-level features with 14.8% contribution to the overall performance (i.e. F1 measure), in the second position, there is the group of token-level features, with 6.8% contribution. The other two groups, the gazetteer-based features and frequency-based features have small contributions of 1% and 0.5% respectively. However more investigations relate the poor performance of gazetteer-based features to the low coverage of the used gazetteer (i.e. ETIM).
Our experiments also show that all learning models over-fit the training data when a large number of features is used; thus the use of feature selection techniques is essential to the robustness of the proposed solutions. Among the studied learning models, the performance of non-linear SVC and AdaBoost models strongly depends on the used hyper-parameters. Therefore for those models the computational cost of the hyper-parameters tuning is justifiable.
To celebrate Peter Apers' retirement, we created The Apers Tree, which displays the Academic Genealogy of Peter Apers. The tree is inspired by the wonderful Mathematics Genealogy Project and a gift from the Database Group of the University Twente on the occasion of Peter's retirement on 16 February 2018.
Check out the Apers Tree on Github.
Cross-Domain Authorship Attribution as a Tool for Digital Investigations
by Christel Geurts
On the darkweb sites promoting illegal content are abundant and new sites are constantly created. At the same time Law Enforcement is working hard to take these sites down and track down the persons involved. Often, after taking down a site, users change their name and move to a different site. But what if Law Enforcement could track users across sites? Different sites or sources of information are called a domain. As the domain changes, often the context of a message also changes, making it challenging to track users simply on words used. The aim of this thesis is to develop a system that can link written text of authors in a cross-domain setting. The system was tested on a blog corpus and verified on police data. Tests show that multinomial logistic regression and Support Vector Machines with a linear kernel perform well. Character 3-grams work well as features, combining multiple feature sets increases performance. Tests show that Logistic Regression models with a combined feature set performed best (accuracy = 0.717, MRR = 0.7785, 1000 authors (blog corpus)). On the police data the Logistic Regression model had an accuracy of 0.612 and a MRR of 0.6883 for 521 authors.
The Case of the Dutch Folktale Database
by Iwe Muiser, Mariët Theune, Ruud de Jong, Nigel Smink, Dolf Trieschnigg, Djoerd Hiemstra, and Theo Meder
This paper demonstrates the use of a user-centred design approach for the development of generous interfaces/rich prospect browsers for an online cultural heritage collection, determining its primary user groups and designing different browsing tools to cater to their specific needs. We set out to solve a set of problems faced by many online cultural heritage collections. These problems are lack of accessibility, limited functionalities to explore the collection through browsing, and risk of less known content being overlooked. The object of our study is the Dutch Folktale Database, an online collection of tens of thousands of folktales from the Netherlands. Although this collection was designed as a research commodity for folktale experts, its primary user group consists of casual users from the general public. We present the new interfaces we developed to facilitate browsing and exploration of the collection by both folktale experts and casual users. We focus on the user-centred design approach we adopted to develop interfaces that would fit the users' needs and preferences.
Published in Digital Humanities Quarterly 11(4), 2017
Access the Folktale Database at: http://www.verhalenbank.nl/.
The past months Searsia investigated ways for search engines to provide search advertisements without participating in the large advertisement networks of Google and Facebook, and more importantly, without the need for search engines to track their users.
Slides of the keynote at the 1st International Workshop on LEARning Next gEneration Rankers, LEARNER 2017 on 1 October 2017 in Amsterdam are now available:
Download the paper: Niek Tax, Sander Bockting, and Djoerd Hiemstra. “A cross-benchmark comparison of 87 learning to rank methods'’, Information Processing and Management 51(6), Elsevier, pages 757–772, 2015 [download pdf]