Some thoughts on BERT and word pieces

Musings for today’s coffee talk

coffeeAt SIGIR 2016 in Pisa, Christopher Manning argued that Information Retrieval would be the next field to fully embrace deep neural models. I was sceptic at the time, but by now it is clear that Manning was right: 2018 turned out to bring breakthroughs in deep neural modelling that finally seem to benefit information retrieval systems. Obviously, I am talking about general purpose language models like ELMO, Open-GPT and BERT that allow researchers to use models that are pre-trained on lots of data, and then fine-tune those models to the specific task and domain they are studying. This fine-tuning of models, which is also called Transfer Learning, needs relatively little training data and training time, but produces state-of-the-art results on several tasks. Particularly the application of Google’s BERT has been successful on some fairly general retrieval tasks; Jimmy Lin’s recantation: The neural hype justified! is a useful article to read for an overview.

BERT (which stands for Bidirectional Encoder Representations from Transformers, see: Pre-training of Deep Bidirectional Transformers for Language Understanding) uses a 12-layer deep neural network that is trained to predict masked parts of a sentence as well as the relationship between sentences using the work of Ashish Vaswani and colleagues from 2017 (Attention is all you need). Interestingly, BERT uses a very limited input vocabulary of only 30,000 words or word pieces. If we give BERT the sentence: “here is the sentence i want embeddings for.” it will tokenize it by splitting the word “embeddings” in four word pieces (example from the BERT tutorial by Chris McCormick and Nick Ryan)

['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']

BERT does this presumably for two reasons: 1) to speed up processing and decrease the number of parameters to be trained; and 2) to gracefully handle out-of-vocabulary words, which will occur in unseen data no matter how big of a vocabulary the model uses. The word piece models are based on (and successfully used for) Google’s neural machine translation system, which was again inspired by Japanese and Korean voice search. The latter approach builds the vocabulary using the following procedure, called Byte Pair Encoding which was developed for data compression.

  1. Initialize the vocabulary with all basic characters of the language (so 52 letters for case-sensitive English and some punctuation, but maybe over 11,000 for Korean);
  2. Build a language model with this vocabulary;
  3. Generate new word pieces by combining pairs of pieces. Add those new pieces to the vocabulary that increase the language model’s likelihood on the training data the most, i.e., pieces that occur a lot consecutively in the training data;
  4. Goto Step 2 unless the maximum vocabulary size is reached.

The approach is described in detail by Rico Senrich and colleagues (Neural Machine Translation of Rare Words with Subword Units) including the open source implementation subword-nmt. Their solution picks the most frequent pairs in Step 3, which seems to be suboptimal from a language modelling perspective: If the individual pieces occur a lot, the new combined piece might frequently by chance. The approach also does not export the frequencies (of probabilities) with the vocabulary. A more principled approach is taken by Taku Kudo (Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates) which comes with the open source implementation sentencepiece. This approach uses a simple unigram model of word pieces, and optimizes its probabilities on the training data using expectation maximization training. The algorithm that finds the optimal vocabulary is less systematic than the byte pair encoding algorithm above, instead starting with a big heuristics-based vocabulary and decreasing its size during the training process.

Of course, word segmentation has always been important for languages that do not use spaces, such as Chinese and Japanese. It has been useful for languages that allow compound nouns too, for instance Dutch and German. However, decreasing the vocabulary size for an English retrieval task seems a counter-intuitive approach, certainly given the amount of work on increasing the vocabulary size by adding phrases. Increased vocabularies for retrieval were for instances evaluated by Mandar Mitra et al. and Andrew Turpin and Alistair Moffat, and our own Kees Koster and Marc Seuter, but almost always without little success.

As a final thought, I think it is interesting that a successful deep neural model like BERT uses good old statistical NLP for deriving its word piece vocabulary. The word pieces literally form the basis of BERT, but they are not based on a neural network approach themselves. I believe that in the coming years, researchers will start to replace many of the typical neural approaches, like back propagation and softmax normalization by more principled approaches like expectation maximization and maximum likelihood estimation. But I haven’t been nearly as often right about the future as Chris Manning, so don’t take my word for it.

Leave a Reply

Your email address will not be published. Required fields are marked *