An empirical study of the effect of vocabulary size for various tokenization strategies in passage retrieval performance.
by Fausto de Lang
Many interactions between the the fields of lexical retrieval and large language models still remain underexplored, in particular there is little research into the use of advanced language model tokenizers in combination with classical information retrieval mechanisms. This research looks into the effect of vocabulary size for various tokenization strategies in passage retrieval performance. It also provides an overview of the impact of the WordPiece, Byte-Pair Encoding and Unigram tokenization techniques on the MSMARCO passage retreival task. These techniques are explored in both re-trained tokenizers and in tokenizers trained from scratch. Based on three metrics this research has found the WordPiece tokenization technique is the best performing technique on the MSMARCO passage retrieval tasks. It has also found that a training vocabulary size of around 10,000 tokens is best in regards to Recall performance, while around 320,000 tokens shows the optimal Mean Reciprocal Rank and Normalized Discounted Cumulative Gain scores. Most importantly, the optimum at a relatively small vocabulary size suggests that shorter subwords can benefit the indexing and searching process (up to a certain point). This is a meaningful result since it means that many applications where (re-)trained tokenizers are used in information retrieval capacity might be improved by tweaking the vocabulary size during training. This research has mainly focused on building a bridge between (re-)trainable tokenizers and information retrieval software, while reporting on interesting tunable parameters. Finally, this research recommends researchers to build their
own tokenizer from scratch since it forces one to look at the configuration of the underlying processing steps.
Defended on 27 June 2023
Git repository at: gitlab.com/tokenization/Lucene