for a MSc thesis project on:
Generating synthetic clinical data for shared Machine Learning tasks
Goal: We want to develop methods for researchers to work on shared tasks for which we cannot share the real data because of privacy concerns, in particular clinical data. The envisioned approach is to share synthetic data that is programmatically generated using large-scale language representations like GPT-2 that are fine-tuned to the real data using proper anonymization safe-guards. Additionally, we will research programmatically generating annotations for this data to support shared machine learning and natural language processing tasks using for instance the approaches from Snorkel.
This way researchers and practitioners from different institutions can cooperate on a classification, pseudonimization or tagging task, by working on the synthetic data, possibly using a competitive “Kaggle” approach. Some research questions we want to tackle are:
- Can we generate convincing data? (and how to measure this?)
- Does it prevent private data leakage?
- Can we generate correct annotations of the data?
- How much manual labour is needed, if any?
- Can the synthetic data be used to train AI, and do the trained models work on the real data?
This is a project in cooperation with RUMC, Nedap and Leiden University.
by Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra
Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.
To be presented at the ACM WSDM Health Search and Data Mining Workshop HSDM 2020 on 3 February 2020 in Houston, USA.
[download preprint] [download from arXiv]
Source code is available as deidentify. We aimed to make it easy for others to apply the pre-trained models to new data, so we bundled the code as Python package which can be installed with pip.
Our paper received the Best paper award!
by Somtochukwu Enendu, Johannes Scholtes, Jeroen Smeets, Djoerd Hiemstra, and Mariet Theune
This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural network method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
Presented at The Conference on Natural Language Processing (“Konferenz zur Verarbeitung natürlicher Sprache”, KONVENS) on 9-11 October in Nürnberg, Germany
Predicting Semantic Labels of Text Regions in Heterogeneous Document Images
by Somtochukwu Enendu
This MSc thesis describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the Conditional Random Field method is robust, outperforming the neural network when limited training data is available. Regarding generalizability, our experiments show that the inclusion of textual features does not guarantee performance improvements.
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records
by Jan Trienes
Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a novel dataset consisting of the medical records of 1260 patients among three domains of Dutch healthcare. We test the generalizability across languages and domains for three de-identification methods. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data, and that a state-of-the-art neural architecture outperforms rule-based and feature-based methods when testing on new domains even when limited training data is available.
Information Retrieval by Semantically Grouping Search Query Data
by Wim Florijn
Query data analysis is a time-consuming task. Currently, a method exists where word (combinations) in queries are labelled by using an information collection consisting of regular expressions. Because the information collection does not contain regular expressions from never-before seen domains, the method heavily relies on manual work, resulting in decreased scalibility. Therefore, a machine-learning based method is proposed in order to automate the annotation of word (combinations) in queries. This research searches for the optimal configuration of a pre-processing method, word embedding model, additional data set and classifier variant. All configurations have been examined on multiple data sets, and appropriate performance metrics have been calculated. The results show that the optimal configuration consists of omitting pre-processing, training a fastText model and enriching word features using additional data in combination with a recurrent classifier. We found that an approach using machine learning is able to obtain excellent performance on the task of labelling word (combinations) in search queries.
Logical Structure Extraction of Electronic Documents Using Contextual Information
by Semere Bitew
Logical document structure extraction refers to the process of coupling the semantic meanings (logical labels) such as title, authors, affiliation, etc., to physical sections in a document. For example, in scientific papers the first paragraph is usually a title. Logical document structure extraction is a challenging natural language processing problem. Elsevier, as one of the biggest scientific publishers in the world, is working on recovering logical structure from article submissions in its project called the Apollo project. The current process in this project requires the involvement of human annotators to make sure logical entities in articles are labelled with correct tags, such as title, abstract, heading, reference-item and so on. This process can be more efficient in producing correct tags and in providing high quality and consistent publishable article papers if it is automated. A lot of research has been done to automatically extract the logical structure of documents. In this thesis, a document is defined as a sequence of paragraphs and recovering the labels for each paragraph yields the logical structure of a document. For this purpose, we proposed a novel approach that combines random forests with conditional random fields (RF-CRFs) and long short-term memory with CRFs (LSTM-CRFs). Two variants of CRFs called linear-chain CRFs (LCRFs) and dynamic CRFs (DCRFs) are used in both of the proposed approaches. These approaches consider the label information of surrounding paragraphs when classifying paragraphs. Three categories of features namely, textual, linguistic and markup features are extracted to build the RF-CRF models. A word embedding is used as an input to build the LSTM-CRF models. Our models were evaluated for extracting reference-items on Elsevier’s Apollo dataset of 146,333 paragraphs. Our results show that LSTM-CRF models trained on the dataset outperform the RF-CRF models and existing approaches. We show that the LSTM component efficiently uses past feature inputs within a paragraph. The CRF component is able to exploit the contextual information using the tag information of surrounding paragraphs. It was observed that the feature categories are complementary. They produce the best performance when all the features are used. On the other hand, this manual feature extraction can be replaced with an LSTM, where no handcrafted features are used, achieving a better performance. Additionally, the inclusion of features generated for the previous and next paragraph as part of the feature vector for classifying the current paragraph improved the performance of all the models.
by Jordy Michorius.
In this research an approach for bias reduction, while still maintaining usability of the classifier, is proposed. The approach for bias reduction requires all preprocessing to be done, include one-hot encoding and making the training and test set split. The approach then requires a banned feature, a feature that has for example been deemed morally irrelevant for the classification purpose. For the bias reduction, the proposal is to use the KS-score obtained from the two sample KS-test to determine how well a feature contributes to classification and how well it contributes to the bias of the banned feature. So that means that all features present in the dataset that are not the label(L) or the banned feature(B), that the following holds for feature X to be safe to use in the training dataset:
KS–score(X|L=1, X|L=0) > KS–score(X|B=1, X|B=0)
After all features are checked, the unsafe (or flagged) features need to be removed from both the training and the test set in order to make the classifier as fair as possible. The datasets that have been used are the Titanic dataset, with as banned feature the passenger class and a Financial survey, with as banned feature the race. The results have shown that the overall bias has been reduced for both the Titanic dataset and the Financial survey. However in terms of relative fairness, the Financial survey is the only one that became less fair for a certain banned feature value (Race = White). All other values became fairer for both the Financial survey and the Titanic dataset.
Automatic Product Name Recognition from Short Product Descriptions
by Elnaz Pazhouhi
This thesis studies the problem of product name recognition from short product descriptions. This is an important problem especially with the increasing use of ERP (Enterprise Resource Planning) software at the core of modern business management systems, where the information of business transactions is stored in unstructured data stores. A solution to the problem of product name recognition is especially useful for the intermediate businesses as they are interested in finding potential matches between the items in product catalogs (produced by manufactures or another intermediate business) and items in the product requests (given by the end user or another intermediate business).
In this context the problem of product name recognition in specifically challenging because product descriptions are typically short, ungrammatical, incomplete, abbreviated and multilingual. In this thesis we investigate the application of supervised machine-learning techniques and gazetteer-based techniques to our problem. To approach the problem, we define it as a classification problem where the tokens of product descriptions are classified into I, O and B classes according to the standard IOB tagging scheme. Next we investigate and compare the performance of a set of hybrid solutions that combine machine learning and gazetteer-based approaches. We study a solution space that uses four learning models: linear and non-linear SVC, Random Forest, and AdaBoost. For each solution, we use the same set of features. We divide the features into four categories: token-level features, document-level features, gazetteer-based features and frequency-based features. Moreover, we use automatic feature selection to reduce the dimensionality of data; that consequently improves the training efficiency and avoids over-fitting.
To be able to evaluate the solutions, we develop a machine learning framework that takes as its inputs a list of predefined solutions (i.e. our solution space) and a preprocessed labeled dataset (i.e. a feature vector X, and a corresponding class label vector Y). It automatically selects the optimal number of most relevant features, optimizes the hyper-parameters of the learning models, trains the learning models, and evaluates the solution set. We believe that our automated machine learning framework, can effectively be used as an AutoML framework that automates most of the decisions that have to be made in the design process of a machine learning solution for a particular domain (e.g. for product name recognition).
Moreover, we conduct a set of experiments and based on the results, we answer the research questions of this thesis. In particular, we determine (1) which learning models are more effective for our task, (2) which feature groups contain the most relevant features (3) what is the contribution of different feature groups to the overall performance of the induced model, (4) how gazetteer-based features are incorporated with the machine learning solutions, (5) how effective gazetteer-based features are, (6) what the role of hyper-parameter optimization is and (7) which models are more sensitive to the hyper-parameters optimization.
According to our results, the solutions with maximum and minimum performance are non-linear SVC with an F1 measure of 65% and AdaBoost with an F1 measure of 59% respectively. This reveals that the role of classifiers is not considerable in the final outcome of the learning model, at least according to the studied dataset. Additionally, our results show that the most effective feature group is the document-level features with 14.8% contribution to the overall performance (i.e. F1 measure), in the second position, there is the group of token-level features, with 6.8% contribution. The other two groups, the gazetteer-based features and frequency-based features have small contributions of 1% and 0.5% respectively. However more investigations relate the poor performance of gazetteer-based features to the low coverage of the used gazetteer (i.e. ETIM).
Our experiments also show that all learning models over-fit the training data when a large number of features is used; thus the use of feature selection techniques is essential to the robustness of the proposed solutions. Among the studied learning models, the performance of non-linear SVC and AdaBoost models strongly depends on the used hyper-parameters. Therefore for those models the computational cost of the hyper-parameters tuning is justifiable.
Slides of the keynote at the 1st International Workshop on LEARning Next gEneration Rankers, LEARNER 2017 on 1 October 2017 in Amsterdam are now available:
Download the paper: Niek Tax, Sander Bockting, and Djoerd Hiemstra. “A cross-benchmark comparison of 87 learning to rank methods'’, Information Processing and Management 51(6), Elsevier, pages 757–772, 2015 [download pdf]