We are looking for a PhD candidate to join the Data Science group at Radboud University for an exciting new project on transfer learning for language modelling with an application for federated search. Transfer learning learns general purpose language models from huge datasets, such as web crawls, and then trains the models further on smaller datasets for a specific task. Transfer learning in NLP has successfully used pre-trained word-embeddings for several tasks. Although the success of word embeddings on search tasks has been limited, recently pre-trained general purpose language representations such as BERT and ELMo have been successful on several search tasks, including question answering tasks and conversational search tasks. Resource descriptions in federated search consist of samples of the full data that are sparser than full resource representations. This raises the question of how to infer vocabulary that is missing from the sampled data. A promising approach comes from transfer learning from pre-trained language representations. An open question is how to effectively and efficiently apply those pre-trained representations and how to adapt them to the domain of federated search. In this project, you will use pre-trained language models, and further train those models for a (federated) search task. You will evaluate the quality of those models as part of international evaluation conferences like the Text Retrieval Conference (TREC) and the Conference and Labs of the Evaluation Forum (CLEF).
by Somtochukwu Enendu, Johannes Scholtes, Jeroen Smeets, Djoerd Hiemstra, and Mariet Theune
This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural network method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
Presented at The Conference on Natural Language Processing (“Konferenz zur Verarbeitung natürlicher Sprache”, KONVENS) on 9-11 October in Nürnberg, Germany
If you search for something on the internet, you use Google. The fact that this allows the company to learn quite a bit about you is bothering more and more people. Can it be done differently?
Read more on Radboud Recharge.
Visualization recommendation in a natural setting
by Ties de Kock
Data visualization is often the first step in data analysis. However, creating visualizations is hard: it depends on both knowledge about the data and design knowledge. While more and more data is becoming available, appropriate visualizations are needed to explore this data and extract information. Knowledge of design guidelines is needed to create useful visualizations, that are easy to understand and communicate information effectively.
Visualization recommendation systems support an analyst in choosing an appropriate visualization by providing visualizations, generated from design guidelines implemented as (design) rules. Finding these visualizations is a non-convex optimization problem where design rules are often mutually exclusive: For example, on a scatter plot, the axes can often be swapped; however, it is common to have time on the x-axis.
We propose a system where design rules are implemented as hard criteria and heuristics encoded as soft criteria that do not need to be satisfied, that guide the system toward effective chart designs. We implement this approach in a visualization recommendation system named OVERLOOK , modeled as an optimization problem implemented with the Z3 Satisfiability Modulo Theories solver. Solving this multi-objective optimization problem results in a Pareto front of visualizations balancing heuristics, of which the top results were evaluated in a user study using an evaluation scale for the quality of visualizations as well as the low-level component tasks for which they can be used. In evaluation, we did not find a difference in performance between OVERLOOK and a baseline of manually created visualizations for the same datasets.
We demonstrated OVERLOOK, a system that creates visualization prototypes based on formal rules and ranks them using the scores from both hard- and soft criteria. The visualizations from OVERLOOK were evaluated in a user study for quality. We demonstrate that the system can be used in a realistic setting. The results lead to future work on learning weights for partial scores, given a low-level component task, based on the human quality annotations for generated visualizations.
Predicting Semantic Labels of Text Regions in Heterogeneous Document Images
by Somtochukwu Enendu
This MSc thesis describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the Conditional Random Field method is robust, outperforming the neural network when limited training data is available. Regarding generalizability, our experiments show that the inclusion of textual features does not guarantee performance improvements.
by Enno Ruijters, Carlos Budde, Muhammad Nakhaee, Mariëlle Stoelinga, Doina Bucur, Djoerd Hiemstra, and Stefano Schivo
This paper presents FFORT (the Fault tree FOResT): A large, diverse, extendable, and open benchmark suite consisting of fault tree models, together with relevant metadata. Fault trees are a common formalism in reliability engineering, and the FFORT benchmark brings together a large and representative suite of fault tree models. The benchmark provides each fault tree model in standard Galileo format, together with references to its origin, and a textual and/or graphical description of the tree. This includes quantitative information such as failure rates, and the results of quantitative analyses of standard reliability metrics, such as the system reliability, availability and mean time to failure. Thus, the FFORT benchmark provides:(1) Examples of how fault trees are used in various domains; (2) A large class of tree models to evaluate fault tree methods and tools; (3) Results of analyses to compare newly developed methods with the benchmark results. Currently, the benchmark suite contains 202 fault tree models of great diversity in terms of size, type, and application domain. The benchmark offers statistics on several relevant model features, indicating e.g. how often such features occur in the benchmark, as well as search facilities for fault tree models with the desired features. Inaddition to the trees already collected, the website provides a user-friendly submission page, allowing the general public to contribute with more fault trees and/or analysis results with new methods. Thereby, we aim to provide an open-access, representative collection of fault trees at the state of the art in modeling and analysis.
Presented at the 29th European Safety and Reliability Conference (ESREL 2019) in Hannover, Germany
by Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra
The 41st European Conference on Information Retrieval (ECIR) was held in Cologne, Germany, during April 14–18, 2019, and brought together hundreds of researchers from Europe and abroad. The conference was organized by GESIS–Leibniz Institute for the Social Sciences and the University of Duisburg-Essen — in cooperation with the British Computer Society’s Information Retrieval Specialist Group (BCS-IRSG). These proceedings contain the papers, presentations, workshops, and tutorials given during the conference. This year the ECIR 2019 program boasted a variety of novel work from contributors from all around the world and provided new platforms for promoting information retrieval-related (IR) activities from the CLEF Initiative. In total, 365 submissions were fielded across the tracks from 50 different countries.
The final program included 39 full papers (23% acceptance rate), 44 short papers (29% acceptance rate), eight demonstration papers (67% acceptance rate), nine reproducibility full papers (75% acceptance rate), and eight invited CLEF papers. All submissions were peer reviewed by at least three international Program Committee members to ensure that only submissions of the highest quality were included in the final program. As part of the reviewing process we also provided more detailed review forms and guidelines to help reviewers identify common errors in IR experimentation as a way to help ensure consistency and quality across the reviews.
The accepted papers cover the state of the art in IR: evaluation, deep learning, dialogue and conversational approaches, diversity, knowledge graphs, recommender systems, retrieval methods, user behavior, topic modelling, etc., and also include novel application areas beyond traditional text and Web documents such as the processing and retrieval of narrative histories, images, jobs, biodiversity, medical text, and math. The program boasted a high proportion of papers with students as first authors, as well as papers from a variety of universities, research institutes, and commercial organizations.
In addition to the papers, the program also included two keynotes, four tutorials, four workshops, a doctoral consortium, and an industry day. The first keynote was presented by this year’s BCS IRSG Karen Sparck Jones Award winner, Prof. Krisztian Balog, On Entities and Evaluation, and the second keynote was presented by Prof. Markus Strohmaier, On Ranking People. The tutorials covered a range of topics from conducting lab-based experiments and statistical analysis to categorization and deeplearning, while the workshops brought together participants to discuss algorithm selection (AMIR), narrative extraction (Text2Story), Bibliometrics (BIR), as well as social media personalization and search (SoMePeAS). As part of this year’s ECIR we also introduced a new CLEF session to enable CLEF organizers to report on and promote their upcoming tracks. In sum, this added to the success and diversity of ECIR and helped build bridges between communities.
The success of ECIR 2019 would not have been possible without all the help from the team of volunteers and reviewers. We wish to thank all our track chairs for coordinating the different tracks along with the teams of meta-reviewers and reviewers who helped ensure the high quality of the program. We also wish to thank the demo chairs: Christina Lioma and Dagmar Kern; student mentorship chairs: Ahmet Aker and Laura Dietz; doctoral consortium chairs: Ahmet Aker, Dimitar Dimitrov and Zeljko Carevic; workshop chairs: Diane Kelly and Andreas Rauber; tutorial chairs: Guillaume Cabanac and Suzan Verberne; industry chair: Udo Kruschwitz; publicity chair: Ingo Frommholz; and sponsorship chairs: Jochen L. Leidner and Karam Abdulahhad. We would like to thank our webmaster, Sascha Schüller and our local chair, Nina Dietzel along with all the student volunteers who helped to create an excellent online and offline experience for participants and attendees.
Published as: Advances in Information Retrieval. Proceedings of the 41st European Conference on Information Retrieval Research (ECIR), Lecture Notes in Computer Science, volumes 11437 and 11438, Springer, 2019
[Part I] [Part II]
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records
by Jan Trienes
Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a novel dataset consisting of the medical records of 1260 patients among three domains of Dutch healthcare. We test the generalizability across languages and domains for three de-identification methods. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data, and that a state-of-the-art neural architecture outperforms rule-based and feature-based methods when testing on new domains even when limited training data is available.
(written for CS teaching mailing no. 16 of 11 July)
As of 1 July, I will leave the U. Twente after almost 30 years (first as student, then as PhD student, finally as staff member) for a new challenge at the Radboud University in Nijmegen. I am proud to announce that I will join Radboud University’s faculty of science as professor of Federated Search.
I was privileged to teach in a world that changed a lot since I became an assistant professor (in 2001). Today, university-level courses are no longer taught for the privileged few at universities in developed countries. They are now freely available to anyone online via platforms like Coursera, edX, FutureLearn and on social media, such as on YouTube. Over the last 18 years, I tried to stimulate students to find additional study material online. In return I tried to contribute to the online study material by publishing my teaching material for students to use and for colleagues to share (my Canvas courses are still entirely publicly available) and by using novel social media like UT Mastodon (https://mastodon.utwente.nl).
In my years at the UT, I enjoyed promoting critical thinking by letting students actively put theory to practice, instead of letting students passively absorb knowledge. I particularly enjoyed developing the MSc course Managing Big Data with Maarten Fokkinga and Robin Aly (later perfected by Doina Bucur) where students analysed terabytes of data on a large Hadoop cluster. I enjoyed developing the BSc module Data & Information with Klaas Sikkel, Maurice van Keulen and Luís Ferreira Pires, where we let students work in agile teams, including daily stand-ups, sprint review meetings, and sprint backlogs. I also very much liked running the MSc course Information Retrieval with Paul van der Vet, Theo Huibers and Dolf Trieschnigg, where students used open source search engines and actively contributed to our research. Some of that work was published, and in such cases, students presented their work at international workshops or conferences.
Saying goodbye to Twente is harder than I expected. But remember, Nijmegen is close by: Feel free to contact me. As for PhD students, I intend to continue to be an active contributor to the courses of the Dutch research school SIKS: I hope to see you there.
A multilevel analysis of Movember’s fundraising campaigns in 24 countries
by Tijs van den Broek, Ariana Need, Michel Ehrenhard, Anna Priante, and Djoerd Hiemstra
This study examines how the interplay between an online campaign’s network structure and prosocial cultural norms in a country affect charitable giving. We conducted a multilevel analysis that includes Twitter network and aggregated donation data from the 2013 Movember fundraising campaigns in 24 countries during 62 campaign days. Prosocial cultural norms did not affect the relationship between network size and average donations raised, nor did they affect the relationship between network centralization and average donation amount. Prosocial cultural norms did affect the relationship between network density and average donations raised. However, this effect was negative and contrary to our expectation.
Published in Social Networks 58, pages 128-135