Fausto de Lang graduates on tokenization for information retrieval

An empirical study of the effect of vocabulary size for various tokenization strategies in passage retrieval performance.

by Fausto de Lang

Many interactions between the the fields of lexical retrieval and large language models still remain underexplored, in particular there is little research into the use of advanced language model tokenizers in combination with classical information retrieval mechanisms. This research looks into the effect of vocabulary size for various tokenization strategies in passage retrieval performance. It also provides an overview of the impact of the WordPiece, Byte-Pair Encoding and Unigram tokenization techniques on the MSMARCO passage retreival task. These techniques are explored in both re-trained tokenizers and in tokenizers trained from scratch. Based on three metrics this research has found the WordPiece tokenization technique is the best performing technique on the MSMARCO passage retrieval tasks. It has also found that a training vocabulary size of around 10,000 tokens is best in regards to Recall performance, while around 320,000 tokens shows the optimal Mean Reciprocal Rank and Normalized Discounted Cumulative Gain scores. Most importantly, the optimum at a relatively small vocabulary size suggests that shorter subwords can benefit the indexing and searching process (up to a certain point). This is a meaningful result since it means that many applications where (re-)trained tokenizers are used in information retrieval capacity might be improved by tweaking the vocabulary size during training. This research has mainly focused on building a bridge between (re-)trainable tokenizers and information retrieval software, while reporting on interesting tunable parameters. Finally, this research recommends researchers to build their
own tokenizer from scratch since it forces one to look at the configuration of the underlying processing steps.

Defended on 27 June 2023

Git repository at: gitlab.com/tokenization/Lucene

Cross-Market Product-Related Question Answering

by Negin Ghasemi, Mohammad Aliannejadi, Hamed Bonab, Evangelos Kanoulas, Arjen de Vries, James Allan, and Djoerd Hiemstra

Online shops such as Amazon, eBay, and Etsy continue to expand their presence in multiple countries, creating new resource-scarce marketplaces with thousands of items. We consider a marketplace to be resource-scarce when only limited user-generated data is available about the products (e.g., ratings, reviews, and product-related questions). In such a marketplace, an information retrieval system is less likely to help users find answers to their questions about the products. As a result, questions posted online may go unanswered for extended periods. This study investigates the impact of using available data in a resource-rich marketplace to answer new questions in a resource-scarce marketplace, a new problem we call cross-market question answering. To study this problem’s potential impact, we collect and annotate a new dataset, XMarket-QA, from Amazon’s UK (resource-scarce) and US (resource-rich) local marketplaces. We conduct a data analysis to understand the scope of the cross-market question-answering task. This analysis shows a temporal gap of almost one year between the first question answered in the UK marketplace and the US marketplace. Also, it shows that the first question about a product is posted in the UK marketplace only when 28 questions, on average, have already been answered about the same product in the US marketplace. Human annotations demonstrate that, on average, 65% of the questions in the UK marketplace can be answered within the US marketplace, supporting the concept of cross-market question answering. Inspired by these findings, we develop a new method, CMJim, which utilizes product similarities across marketplaces in the training phase for retrieving answers from the resource-rich marketplace that can be used to answer a question in the resource-scarce marketplace. Our evaluations show CMJim’s significant improvement compared to competitive baselines.

To be presented at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) on July 23-27 in Taipei, Taiwan.

[download pdf]

BERT for Target Apps Selection

Analyzing the Diversity and Performance of BERT in Unified Mobile Search

by Negin Ghasemi, Mohammad Aliannejadi, and Djoerd Hiemstra

A unified mobile search framework aims to identify the mobile apps that can satisfy a user’s information need and route the user’s query to them. Previous work has shown that resource descriptions for mobile apps are sparse as they rely on the app’s previous queries. This problem puts certain apps in dominance and leaves out the resource-scarce apps from the top ranks. In this case, we need a ranker that goes beyond simple lexical matching. Therefore, our goal is to study the extent of a BERT-based ranker’s ability to improve the quality and diversity of app selection. To this end, we compare the results of the BERT-based ranker with other information retrieval models, focusing on the analysis of selected apps diversification. Our analysis shows that the BERT-based ranker selects more diverse apps while improving the quality of baseline results by selecting the relevant apps such as Facebook and Contacts for more personal queries and decreasing the bias towards the dominant resources such as the Google Search app.

[More info]

BERT meets Cranfield

Uncovering the Properties of Full Ranking on Fully Labeled Data

by Negin Ghasemi and Djoerd Hiemstra

Recently, various information retrieval models have been proposed based on pre-trained BERT models, achieving outstanding performance. The majority of such models have been tested on data collections with partial relevance labels, where various potentially relevant documents have not been exposed to the annotators. Therefore, evaluating BERT-based rankers may lead to biased and unfair evaluation results, simply because a relevant document has not been exposed to the annotators while creating the collection. In our work, we aim to better understand a BERT-based ranker’s strengths compared to a BERT-based re-ranker and the initial ranker. To this aim, we investigate BERT-based rankers performance on the Cranfield collection, which comes with full relevance judgment on all documents in the collection. Our results demonstrate the BERT-based full ranker’s effectiveness, as opposed to the BERT-based re-ranker and BM25. Also, analysis shows that there are documents that the BERT-based full-ranker finds that were not found by the initial ranker.

To be presented at the Conference of the European Chapter of the Association for Computational Linguistics EACL Student Workshop on 22 April 2021.

[download pdf]

Some thoughts on BERT and word pieces

Musings for today’s coffee talk

coffeeAt SIGIR 2016 in Pisa, Christopher Manning argued that Information Retrieval would be the next field to fully embrace deep neural models. I was sceptic at the time, but by now it is clear that Manning was right: 2018 turned out to bring breakthroughs in deep neural modelling that finally seem to benefit information retrieval systems. Obviously, I am talking about general purpose language models like ELMO, Open-GPT and BERT that allow researchers to use models that are pre-trained on lots of data, and then fine-tune those models to the specific task and domain they are studying. This fine-tuning of models, which is also called Transfer Learning, needs relatively little training data and training time, but produces state-of-the-art results on several tasks. Particularly the application of Google’s BERT has been successful on some fairly general retrieval tasks; Jimmy Lin’s recantation: The neural hype justified! is a useful article to read for an overview.

BERT (which stands for Bidirectional Encoder Representations from Transformers, see: Pre-training of Deep Bidirectional Transformers for Language Understanding) uses a 12-layer deep neural network that is trained to predict masked parts of a sentence as well as the relationship between sentences using the work of Ashish Vaswani and colleagues from 2017 (Attention is all you need). Interestingly, BERT uses a very limited input vocabulary of only 30,000 words or word pieces. If we give BERT the sentence: “here is the sentence i want embeddings for.” it will tokenize it by splitting the word “embeddings” in four word pieces (example from the BERT tutorial by Chris McCormick and Nick Ryan)

['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']

BERT does this presumably for two reasons: 1) to speed up processing and decrease the number of parameters to be trained; and 2) to gracefully handle out-of-vocabulary words, which will occur in unseen data no matter how big of a vocabulary the model uses. The word piece models are based on (and successfully used for) Google’s neural machine translation system, which was again inspired by Japanese and Korean voice search. The latter approach builds the vocabulary using the following procedure, called Byte Pair Encoding which was developed for data compression.

  1. Initialize the vocabulary with all basic characters of the language (so 52 letters for case-sensitive English and some punctuation, but maybe over 11,000 for Korean);
  2. Build a language model with this vocabulary;
  3. Generate new word pieces by combining pairs of pieces. Add those new pieces to the vocabulary that increase the language model’s likelihood on the training data the most, i.e., pieces that occur a lot consecutively in the training data;
  4. Goto Step 2 unless the maximum vocabulary size is reached.

The approach is described in detail by Rico Senrich and colleagues (Neural Machine Translation of Rare Words with Subword Units) including the open source implementation subword-nmt. Their solution picks the most frequent pairs in Step 3, which seems to be suboptimal from a language modelling perspective: If the individual pieces occur a lot, the new combined piece might occur frequently by chance. The approach also does not export the frequencies (of probabilities) with the vocabulary. A more principled approach is taken by Taku Kudo (Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates) which comes with the open source implementation sentencepiece. This approach uses a simple unigram model of word pieces, and optimizes its probabilities on the training data using expectation maximization training. The algorithm that finds the optimal vocabulary is less systematic than the byte pair encoding algorithm above, instead starting with a big heuristics-based vocabulary and decreasing its size during the training process.

Of course, word segmentation has always been important for languages that do not use spaces, such as Chinese and Japanese. It has been useful for languages that allow compound nouns too, for instance Dutch and German. However, decreasing the vocabulary size for an English retrieval task seems a counter-intuitive approach, certainly given the amount of work on increasing the vocabulary size by adding phrases. Increased vocabularies for retrieval were for instances evaluated by Mandar Mitra et al. and Andrew Turpin and Alistair Moffat, and our own Kees Koster and Marc Seuter, but almost always with little success.

As a final thought, I think it is interesting that a successful deep neural model like BERT uses good old statistical NLP for deriving its word piece vocabulary. The word pieces literally form the basis of BERT, but they are not based on a neural network approach themselves. I believe that in the coming years, researchers will start to replace many of the typical neural approaches, like back propagation and softmax normalization by more principled approaches like expectation maximization and maximum likelihood estimation. But I haven’t been nearly as often right about the future as Chris Manning, so don’t take my word for it.