2nd Dutch meeting on Clinical NLP

Now that electronic health records are commonly used, the availability of clinical texts is growing. This workshop discusses the automatic analysis of textual clinical health data to advance medical research and improve healthcare related services. We especially encourage presentations discussing possibilities to share clinical texts, models and tools for clinical natural language processing (NLP). In practice, privacy- and legal regulations prevent the free sharing and combination of electronic health records themselves, but de-identified texts, NLP tools and intermediate results may be shared. We hope that sharing will promote cooperation within the Dutch-speaking countries, as well as advance the research in Clinical NLP in those countries. Relevant topics include, but are not limited to:

  • Data sets with clinical texts
  • Open source tools for Clinical NLP
  • Information extraction from clinical text
  • Information retrieval for clinical text
  • Adapting standard NLP tools for clinical text
  • De-identification and ways to preserve privacy in clinical data
  • Using medical terminologies and ontologies
  • Annotation schemes and annotation methodology for clinical data
  • Evaluation methods for the clinical domain
  • Text-based clinical prediction models
  • Speech recognition for clinical text

We solicit short presentations (15 to 20 minutes) from researchers covering recent work, including work in progress and work that was recently published at journals and/or conferences in the field or made available via data and software sharing platforms like Zenodo or Github. Please email the title and abstract of your presentation to: www-clinicalnlp@science.ru.nl before 12 October 2021.

More information at: https://clinical-nlp.cs.ru.nl

Fien Ockers graduates on medication annotation using weak supervision

Medication annotation in medical reports using weak
supervision

by Fien Ockers

By detecting textual references to medication in the daily reports written in different healthcare institutions, the resulting medication information can be used for research purposes like detecting common occurring adverse events or executing a comparative study into the effectiveness of different treatments. In this project, 4 different models, including a CRF model and three BERT-based models, are used to solve this medication detection task. They are not only trained on a smaller manually annotated train set but also on two extended train sets that are created using two weak supervision systems, Snorkel and Skweak. It is found that the CRF model and RobBERT are the best performing models, and that performance is structurally higher for models trained on the manually annotated train set than the extended train sets. However, model performance for the extended train sets does not fall behind far, showing the potential of using a weak supervision system. Future research could either focus on training a BERT-based tokenizer and model further on the medical domain or focus on expanding the labelling functions used in the weak supervision systems to improve recall or generalize to other medication-related entities such as dosages or modes of administration.

Ismail Güçlü graduates on programmatically generating annotations for clinical data

Programmatically generating annotations for de-identification
of clinical data

by Ismail Güçlü

Clinical records may contain protected health information (PHI) which are privacy sensitive information. It is important to annotate and replace PHI in unstructured medical records, before being able to share the data for other research purposes. Machine learning models are quick to implement and can achieve competitive results (micro-averaged F1-scores Dutch radiology dataset: 0.88 and English i2b2 dataset: 0.87). However, to develop machine learning models, we need training data. In this project, we applied weak supervision to annotate and collect training data for de-identification of medical records. It is essential to automate this process as manual annotation is a laborious and repetitive task. We used the two human annotated datasets, where we ‘removed’ the gold annotations to weakly tag PHI instances in medical records, where we unified the output labels using two different aggregation models: aggregation at the token level (Snorkel) and sequential labelling (Skweak). The output is then used to train a discriminative end model where we achieve competitive results on the Dutch dataset (micro-averaged F1 score: 0.76) whereas performance on the English dataset is sub-optimal (micro-averaged F1-score: 0.49). The results indicate that on structured PHI tags we approach human annotated results, but more complicated entities still need more attention.

[more information]

Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records

by Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra

Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.

To be presented at the ACM WSDM Health Search and Data Mining Workshop HSDM 2020 on 3 February 2020 in Houston, USA.

[download preprint] [download from arXiv]

Source code is available as deidentify. We aimed to make it easy for others to apply the pre-trained models to new data, so we bundled the code as Python package which can be installed with pip.

Our paper received the Best paper award!

Jan Trienes graduates cum laude on de-identification of Dutch medical records

Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records

by Jan Trienes

Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a novel dataset consisting of the medical records of 1260 patients among three domains of Dutch healthcare. We test the generalizability across languages and domains for three de-identification methods. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data, and that a state-of-the-art neural architecture outperforms rule-based and feature-based methods when testing on new domains even when limited training data is available.

Dolf Trieschnigg defends PhD thesis on Biomedical IR

Proof of Concept: Concept-based Biomedical Information Retrieval

by Dolf Trieschnigg

In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. Biomedical IR is concerned with the disclosure of these vast amounts of written knowledge. Biomedical IR is not only important for end-users, such as biologists, biochemists, and bioinformaticians searching directly for relevant literature but also plays an important role in more sophisticated knowledge discovery. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Multiple synonymous terms can be used for single biomedical concepts, such as genes and diseases. Conversely, single terms can be ambiguous, and may refer to multiple concepts. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge in modern word-based information retrieval is, however, far from trivial. This thesis investigates the problem of handling biomedical terminology based on three research themes.

The first research theme deals with robust word-based retrieval. Effective retrieval models commonly use a word-based representation for retrieval. As so many spelling variations are present in biomedical text, the way in which these word-based representations are obtained affect retrieval effectiveness. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. This investigation included stop-word removal, stemming, different approaches to breakpoint identification and normalisation, and character n-gramming. In particular breakpoint identification and normalisation (that is determining word parts in biomedical compounds) showed a strong effect on retrieval performance. A combination of effective preprocessing heuristics was identified and used to obtain word-based representations from text for the remainder of this thesis.

The second research theme deals with concept-based retrieval. We investigated two representation vocabularies for concept-based indexing, one based on the Medical Subject Headings thesaurus, the other based on the Unified Medical Language System metathesaurus extended with a number of gene and protein dictionaries.

We investigated the following five topics.

  1. How documents are represented in a concept-based representation.
  2. To what extent such a document representation can be obtained automatically.
  3. To what extent a text-based query can be automatically mapped onto a concept-based representation and how this affects retrieval performance.
  4. To what extent a concept-based representation is effective in representing information needs.
  5. How the relationship between text and concepts can be used to determine the relatedness of concepts.

We compared different classification systems to obtain concept-based document and query representations automatically. We proposed two classification methods based on statistical language models, one based on K-Nearest Neighbours (KNN) and one based on Concept Language Models (CLM).

For a selection of classification systems we carried out a document classification experiment in which we investigated to what extent automatic classification could reproduce manual classification. The proposed KNN system performed well in comparison to the out-of-the-box systems. Manual analysis indicated the improved exhaustiveness of automatic classification over manual classification. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In an artificial setting, we compared the optimal retrieval performance which could be obtained with word-based and concept-based representations. Contrary to our intuition, on average a single word-based query performed better than a single concept-based representation, even when the best concept term precisely represented part of the information need.

We investigated to what extent the relatedness between pairs of concepts as indicated by human judgements could be automatically reproduced. Results on a small test set indicated that a method based on comparing concept language models performed particularly well in comparison to systems based on taxonomy structure, information content and (document) association.

In the third and last research theme of this thesis we propose a framework for concept-based retrieval. We approached the integration of domain knowledge in monolingual information retrieval as a cross-lingual information retrieval (CLIR) problem. Two languages were identified in this monolingual setting: a word-based representation language based on free text, and a concept-based representation language based on a terminological resource. Similar to what is common in traditional CLIR, queries and documents are translated into the same representation language and matched. The cross-lingual perspective gives us the opportunity to adopt a large set of established CLIR methods and techniques for this domain. In analogy to established CLIR practice, we investigated translation models based on a parallel corpus containing documents in multiple representations and translation models based on a thesaurus. Surprisingly, even the integration of very basic translation models showed improvements in retrieval effectiveness over word-only retrieval. A translation model based on pseudo-feedback translation was shown to perform particularly well. We proposed three extensions to a basic cross-lingual retrieval model which, similar to previous approaches in established CLIR, improved retrieval effectiveness by combining multiple translation models. Experimental results indicate that, even when using very basic translation models, monolingual biomedical IR can benefit from a cross-lingual approach to integrate domain knowledge.

[download pdf]

A Cross-lingual Framework for Monolingual Biomedical Information Retrieval

by Dolf Trieschnigg, Djoerd Hiemstra, Franciska de Jong, and Wessel Kraaij

An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed to deal with this challenge. In this paper, we approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. The approach allows for deployment of a rich set of techniques proposed and evaluated in traditional cross-lingual IR. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a cross-lingual IR framework for monolingual biomedical IR if basic translations models are combined.

The paper will be presented at the 19th ACM International Conference on Information and Knowledge Management on October 26-30 in Toronto, Canada.