Reducing Misinformation in Query Autocompletions

by Djoerd Hiemstra

Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.

To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.

[download pdf] [slides]

Transitioning the Information Retrieval Literature to a Fully Open Access Model

by Djoerd Hiemstra, Marie-Francine Moens, Raffaele Perego, and Fabrizio Sebastiani

Almost all of the important literature on Information Retrieval (IR) is published in subscription-based journals and digital libraries. We argue that the lack of open access publishing in IR is seriously hampering progress and inclusiveness of the field. We propose that the IR community starts working on a road map for transitioning the IR literature to a fully, “diamond”, open access model.

Published in SIGIR Forum 54(1).

[download preprint]

WANTED: MSc students Data Science or AI

for a MSc thesis project on:

Generating synthetic clinical data for shared Machine Learning tasks

Goal: We want to develop methods for researchers to work on shared tasks for which we cannot share the real data because of privacy concerns, in particular clinical data. The envisioned approach is to share synthetic data that is programmatically generated using large-scale language representations like GPT-2 that are fine-tuned to the real data using proper anonymization safe-guards. Additionally, we will research programmatically generating annotations for this data to support shared machine learning and natural language processing tasks using for instance the approaches from Snorkel.

This way researchers and practitioners from different institutions can cooperate on a classification, pseudonimization or tagging task, by working on the synthetic data, possibly using a competitive “Kaggle” approach. Some research questions we want to tackle are:

  1. Can we generate convincing data? (and how to measure this?)
  2. Does it prevent private data leakage?
  3. Can we generate correct annotations of the data?
  4. How much manual labour is needed, if any?
  5. Can the synthetic data be used to train AI, and do the trained models work on the real data?

This is a project in cooperation with RUMC, Nedap and Leiden University.

Professor positions in Machine Learning for Data Science

The Data Science section of the Radboud University seeks to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Deadline: 31 March.

To strengthen and expand the Data Science section’s research, we seek to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Also, these positions will be pivotal for supporting our Bachelor’s programme and our Data Science Master’s specialisations, in particular for Master’s courses that attract many students. The main goal of Machine Learning for Data Science is to develop machine learning approaches and techniques of broader applicability outside a specific application domain. Machine Learning for Data Science involves the study, development and application of machine learning techniques in order to tackle real-life problems involving challenging learning tasks and/or type of data.

[More information]

NoGA: No Google Analytics

We will start a project to investigate alternatives for web analytics of Big Tech companies (Google) as part of the SIDN fonds program Je data de baas (“Mastering your data”). There are 10 projects in the program.

In this project, the Data Science group works together with the Marketing and ICT of the University to find out whether it is a serious option for Radboud University (and other medium to large-scale organizations such as municipalities and hospitals) to take Web analytics in their own hands, instead of outsourcing it to Google.

More information at: https://nogadata.nl

Some thoughts on BERT and word pieces

Musings for today’s coffee talk

coffeeAt SIGIR 2016 in Pisa, Christopher Manning argued that Information Retrieval would be the next field to fully embrace deep neural models. I was sceptic at the time, but by now it is clear that Manning was right: 2018 turned out to bring breakthroughs in deep neural modelling that finally seem to benefit information retrieval systems. Obviously, I am talking about general purpose language models like ELMO, Open-GPT and BERT that allow researchers to use models that are pre-trained on lots of data, and then fine-tune those models to the specific task and domain they are studying. This fine-tuning of models, which is also called Transfer Learning, needs relatively little training data and training time, but produces state-of-the-art results on several tasks. Particularly the application of Google’s BERT has been successful on some fairly general retrieval tasks; Jimmy Lin’s recantation: The neural hype justified! is a useful article to read for an overview.

BERT (which stands for Bidirectional Encoder Representations from Transformers, see: Pre-training of Deep Bidirectional Transformers for Language Understanding) uses a 12-layer deep neural network that is trained to predict masked parts of a sentence as well as the relationship between sentences using the work of Ashish Vaswani and colleagues from 2017 (Attention is all you need). Interestingly, BERT uses a very limited input vocabulary of only 30,000 words or word pieces. If we give BERT the sentence: “here is the sentence i want embeddings for.” it will tokenize it by splitting the word “embeddings” in four word pieces (example from the BERT tutorial by Chris McCormick and Nick Ryan)

['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']

BERT does this presumably for two reasons: 1) to speed up processing and decrease the number of parameters to be trained; and 2) to gracefully handle out-of-vocabulary words, which will occur in unseen data no matter how big of a vocabulary the model uses. The word piece models are based on (and successfully used for) Google’s neural machine translation system, which was again inspired by Japanese and Korean voice search. The latter approach builds the vocabulary using the following procedure, called Byte Pair Encoding which was developed for data compression.

  1. Initialize the vocabulary with all basic characters of the language (so 52 letters for case-sensitive English and some punctuation, but maybe over 11,000 for Korean);
  2. Build a language model with this vocabulary;
  3. Generate new word pieces by combining pairs of pieces. Add those new pieces to the vocabulary that increase the language model’s likelihood on the training data the most, i.e., pieces that occur a lot consecutively in the training data;
  4. Goto Step 2 unless the maximum vocabulary size is reached.

The approach is described in detail by Rico Senrich and colleagues (Neural Machine Translation of Rare Words with Subword Units) including the open source implementation subword-nmt. Their solution picks the most frequent pairs in Step 3, which seems to be suboptimal from a language modelling perspective: If the individual pieces occur a lot, the new combined piece might occur frequently by chance. The approach also does not export the frequencies (of probabilities) with the vocabulary. A more principled approach is taken by Taku Kudo (Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates) which comes with the open source implementation sentencepiece. This approach uses a simple unigram model of word pieces, and optimizes its probabilities on the training data using expectation maximization training. The algorithm that finds the optimal vocabulary is less systematic than the byte pair encoding algorithm above, instead starting with a big heuristics-based vocabulary and decreasing its size during the training process.

Of course, word segmentation has always been important for languages that do not use spaces, such as Chinese and Japanese. It has been useful for languages that allow compound nouns too, for instance Dutch and German. However, decreasing the vocabulary size for an English retrieval task seems a counter-intuitive approach, certainly given the amount of work on increasing the vocabulary size by adding phrases. Increased vocabularies for retrieval were for instances evaluated by Mandar Mitra et al. and Andrew Turpin and Alistair Moffat, and our own Kees Koster and Marc Seuter, but almost always with little success.

As a final thought, I think it is interesting that a successful deep neural model like BERT uses good old statistical NLP for deriving its word piece vocabulary. The word pieces literally form the basis of BERT, but they are not based on a neural network approach themselves. I believe that in the coming years, researchers will start to replace many of the typical neural approaches, like back propagation and softmax normalization by more principled approaches like expectation maximization and maximum likelihood estimation. But I haven’t been nearly as often right about the future as Chris Manning, so don’t take my word for it.

Guest lecture by Arjen de Vries

Tuesday 17 December 8:30h. in SP-2, prof. Arjen de Vries will give a guest lecture on column-oriented relational database management systems (DBMS). A column-oriented DBMS (or column store) is a DBMS that physically stores tables by column rather than by row. In previous lectures we have been mostly concerned with Online Transaction Processing (OLTP) workloads, with lots of small inserts and lots of queries over parts of the data. Column stores, however, are well-suited for Online Analytical Processing (OLAP) workloads which involve complex analytical queries over all data.

Attendance to this lecture is highly recommended.

Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records

by Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra

Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.

To be presented at the ACM WSDM Health Search and Data Mining Workshop HSDM 2020 on 3 February 2020 in Houston, USA.

[download preprint] [download from arXiv]

Source code is available as deidentify. We aimed to make it easy for others to apply the pre-trained models to new data, so we bundled the code as Python package which can be installed with pip.

Our paper received the Best paper award!