Welcome to the Data Science group, Negin Ghasemi! Negin will work on Transfer Learning for Federated Search.
The Data Science section of the Radboud University seeks to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Deadline: 31 March.
Musings for today’s coffee talk
At SIGIR 2016 in Pisa, Christopher Manning argued that Information Retrieval would be the next field to fully embrace deep neural models. I was sceptic at the time, but by now it is clear that Manning was right: 2018 turned out to bring breakthroughs in deep neural modelling that finally seem to benefit information retrieval systems. Obviously, I am talking about general purpose language models like ELMO, Open-GPT and BERT that allow researchers to use models that are pre-trained on lots of data, and then fine-tune those models to the specific task and domain they are studying. This fine-tuning of models, which is also called Transfer Learning, needs relatively little training data and training time, but produces state-of-the-art results on several tasks. Particularly the application of Google’s BERT has been successful on some fairly general retrieval tasks; Jimmy Lin’s recantation: The neural hype justified! is a useful article to read for an overview.
BERT (which stands for Bidirectional Encoder Representations from Transformers, see: Pre-training of Deep Bidirectional Transformers for Language Understanding) uses a 12-layer deep neural network that is trained to predict masked parts of a sentence as well as the relationship between sentences using the work of Ashish Vaswani and colleagues from 2017 (Attention is all you need). Interestingly, BERT uses a very limited input vocabulary of only 30,000 words or word pieces. If we give BERT the sentence: “here is the sentence i want embeddings for.” it will tokenize it by splitting the word “embeddings” in four word pieces (example from the BERT tutorial by Chris McCormick and Nick Ryan)
['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']
BERT does this presumably for two reasons: 1) to speed up processing and decrease the number of parameters to be trained; and 2) to gracefully handle out-of-vocabulary words, which will occur in unseen data no matter how big of a vocabulary the model uses. The word piece models are based on (and successfully used for) Google’s neural machine translation system, which was again inspired by Japanese and Korean voice search. The latter approach builds the vocabulary using the following procedure, called Byte Pair Encoding which was developed for data compression.
- Initialize the vocabulary with all basic characters of the language (so 52 letters for case-sensitive English and some punctuation, but maybe over 11,000 for Korean);
- Build a language model with this vocabulary;
- Generate new word pieces by combining pairs of pieces. Add those new pieces to the vocabulary that increase the language model’s likelihood on the training data the most, i.e., pieces that occur a lot consecutively in the training data;
- Goto Step 2 unless the maximum vocabulary size is reached.
The approach is described in detail by Rico Senrich and colleagues (Neural Machine Translation of Rare Words with Subword Units) including the open source implementation subword-nmt. Their solution picks the most frequent pairs in Step 3, which seems to be suboptimal from a language modelling perspective: If the individual pieces occur a lot, the new combined piece might frequently by chance. The approach also does not export the frequencies (of probabilities) with the vocabulary. A more principled approach is taken by Taku Kudo (Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates) which comes with the open source implementation sentencepiece. This approach uses a simple unigram model of word pieces, and optimizes its probabilities on the training data using expectation maximization training. The algorithm that finds the optimal vocabulary is less systematic than the byte pair encoding algorithm above, instead starting with a big heuristics-based vocabulary and decreasing its size during the training process.
Of course, word segmentation has always been important for languages that do not use spaces, such as Chinese and Japanese. It has been useful for languages that allow compound nouns too, for instance Dutch and German. However, decreasing the vocabulary size for an English retrieval task seems a counter-intuitive approach, certainly given the amount of work on increasing the vocabulary size by adding phrases. Increased vocabularies for retrieval were for instances evaluated by Mandar Mitra et al. and Andrew Turpin and Alistair Moffat, and our own Kees Koster and Marc Seuter, but almost always without little success.
As a final thought, I think it is interesting that a successful deep neural model like BERT uses good old statistical NLP for deriving its word piece vocabulary. The word pieces literally form the basis of BERT, but they are not based on a neural network approach themselves. I believe that in the coming years, researchers will start to replace many of the typical neural approaches, like back propagation and softmax normalization by more principled approaches like expectation maximization and maximum likelihood estimation. But I haven’t been nearly as often right about the future as Chris Manning, so don’t take my word for it.
by Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra
Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.
To be presented at the ACM WSDM Health Search and Data Mining Workshop HSDM 2020 on 3 February 2020 in Houston, USA.
Source code is available as deidentify. We aimed to make it easy for others to apply the pre-trained models to new data, so we bundled the code as Python package which can be installed with pip.
Our paper received the Best paper award!
The Blind Man and the Elephant: Measuring Economic Impacts of DDoS Attacks
Internet has become an important part of our everyday life. We use services like Netflix, Skype, online banking and scopus etc. daily. We even use Internet for filing our taxes and communicating with municipality. This dependency on network-based technologies also provides an opportunity to malicious actors in our society to remotely attack IT infrastructure. One such cyberattack that may lead to unavailability of network resources is known as distributed denial of service (DDoS) attack. A DDoS attack leverages many computers to launch a coordinated Denial of Service attack against one or more targets.
These attacks cause damages to victim businesses. According to reports published by several consultancies and security companies these attacks lead to millions of dollars in losses every year. One might ponder: are the damages caused by temporary unavailability of network services really this large? One of the points of criticism for these reports has been that they often base their findings on victim surveys and expert opinions. Now, as cost accounting/book keeping methods are not focused on measuring the impact of cyber security incidents, it is highly likely that surveys are unable to capture the true impact of an attack. A concerning fact is that most C-level managers make budgetary decisions for security based on the losses reported in these surveys. Several inputs for security investment decision models such as return on security investment (ROSI) also depend on these figures. This makes the situation very similar to the parable of the blind men and the elephant, who try to conceptualise how the elephant looks like by touching it. Hence, it is important to develop methodologies that capture the true impact of DDoS attacks. In this thesis, we study the economic impact of DDoS attacks on public/private organisations by using an empirical approach.
We are looking for a PhD candidate to join the Data Science group at Radboud University for an exciting new project on transfer learning for language modelling with an application for federated search. Transfer learning learns general purpose language models from huge datasets, such as web crawls, and then trains the models further on smaller datasets for a specific task. Transfer learning in NLP has successfully used pre-trained word-embeddings for several tasks. Although the success of word embeddings on search tasks has been limited, recently pre-trained general purpose language representations such as BERT and ELMo have been successful on several search tasks, including question answering tasks and conversational search tasks. Resource descriptions in federated search consist of samples of the full data that are sparser than full resource representations. This raises the question of how to infer vocabulary that is missing from the sampled data. A promising approach comes from transfer learning from pre-trained language representations. An open question is how to effectively and efficiently apply those pre-trained representations and how to adapt them to the domain of federated search. In this project, you will use pre-trained language models, and further train those models for a (federated) search task. You will evaluate the quality of those models as part of international evaluation conferences like the Text Retrieval Conference (TREC) and the Conference and Labs of the Evaluation Forum (CLEF).
by Somtochukwu Enendu, Johannes Scholtes, Jeroen Smeets, Djoerd Hiemstra, and Mariet Theune
This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural network method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
Presented at The Conference on Natural Language Processing (“Konferenz zur Verarbeitung natürlicher Sprache”, KONVENS) on 9-11 October in Nürnberg, Germany
If you search for something on the internet, you use Google. The fact that this allows the company to learn quite a bit about you is bothering more and more people. Can it be done differently?
Read more on Radboud Recharge.
Visualization recommendation in a natural setting
by Ties de Kock
Data visualization is often the first step in data analysis. However, creating visualizations is hard: it depends on both knowledge about the data and design knowledge. While more and more data is becoming available, appropriate visualizations are needed to explore this data and extract information. Knowledge of design guidelines is needed to create useful visualizations, that are easy to understand and communicate information effectively.
Visualization recommendation systems support an analyst in choosing an appropriate visualization by providing visualizations, generated from design guidelines implemented as (design) rules. Finding these visualizations is a non-convex optimization problem where design rules are often mutually exclusive: For example, on a scatter plot, the axes can often be swapped; however, it is common to have time on the x-axis.
We propose a system where design rules are implemented as hard criteria and heuristics encoded as soft criteria that do not need to be satisfied, that guide the system toward effective chart designs. We implement this approach in a visualization recommendation system named OVERLOOK , modeled as an optimization problem implemented with the Z3 Satisfiability Modulo Theories solver. Solving this multi-objective optimization problem results in a Pareto front of visualizations balancing heuristics, of which the top results were evaluated in a user study using an evaluation scale for the quality of visualizations as well as the low-level component tasks for which they can be used. In evaluation, we did not find a difference in performance between OVERLOOK and a baseline of manually created visualizations for the same datasets.
We demonstrated OVERLOOK, a system that creates visualization prototypes based on formal rules and ranks them using the scores from both hard- and soft criteria. The visualizations from OVERLOOK were evaluated in a user study for quality. We demonstrate that the system can be used in a realistic setting. The results lead to future work on learning weights for partial scores, given a low-level component task, based on the human quality annotations for generated visualizations.
Predicting Semantic Labels of Text Regions in Heterogeneous Document Images
by Somtochukwu Enendu
This MSc thesis describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the Conditional Random Field method is robust, outperforming the neural network when limited training data is available. Regarding generalizability, our experiments show that the inclusion of textual features does not guarantee performance improvements.