Some thoughts on BERT and word pieces

Musings for today’s coffee talk

coffeeAt SIGIR 2016 in Pisa, Christopher Manning argued that Information Retrieval would be the next field to fully embrace deep neural models. I was sceptic at the time, but by now it is clear that Manning was right: 2018 turned out to bring breakthroughs in deep neural modelling that finally seem to benefit information retrieval systems. Obviously, I am talking about general purpose language models like ELMO, Open-GPT and BERT that allow researchers to use models that are pre-trained on lots of data, and then fine-tune those models to the specific task and domain they are studying. This fine-tuning of models, which is also called Transfer Learning, needs relatively little training data and training time, but produces state-of-the-art results on several tasks. Particularly the application of Google’s BERT has been successful on some fairly general retrieval tasks; Jimmy Lin’s recantation: The neural hype justified! is a useful article to read for an overview.

BERT (which stands for Bidirectional Encoder Representations from Transformers, see: Pre-training of Deep Bidirectional Transformers for Language Understanding) uses a 12-layer deep neural network that is trained to predict masked parts of a sentence as well as the relationship between sentences using the work of Ashish Vaswani and colleagues from 2017 (Attention is all you need). Interestingly, BERT uses a very limited input vocabulary of only 30,000 words or word pieces. If we give BERT the sentence: “here is the sentence i want embeddings for.” it will tokenize it by splitting the word “embeddings” in four word pieces (example from the BERT tutorial by Chris McCormick and Nick Ryan)

['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']

BERT does this presumably for two reasons: 1) to speed up processing and decrease the number of parameters to be trained; and 2) to gracefully handle out-of-vocabulary words, which will occur in unseen data no matter how big of a vocabulary the model uses. The word piece models are based on (and successfully used for) Google’s neural machine translation system, which was again inspired by Japanese and Korean voice search. The latter approach builds the vocabulary using the following procedure, called Byte Pair Encoding which was developed for data compression.

  1. Initialize the vocabulary with all basic characters of the language (so 52 letters for case-sensitive English and some punctuation, but maybe over 11,000 for Korean);
  2. Build a language model with this vocabulary;
  3. Generate new word pieces by combining pairs of pieces. Add those new pieces to the vocabulary that increase the language model’s likelihood on the training data the most, i.e., pieces that occur a lot consecutively in the training data;
  4. Goto Step 2 unless the maximum vocabulary size is reached.

The approach is described in detail by Rico Senrich and colleagues (Neural Machine Translation of Rare Words with Subword Units) including the open source implementation subword-nmt. Their solution picks the most frequent pairs in Step 3, which seems to be suboptimal from a language modelling perspective: If the individual pieces occur a lot, the new combined piece might occur frequently by chance. The approach also does not export the frequencies (of probabilities) with the vocabulary. A more principled approach is taken by Taku Kudo (Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates) which comes with the open source implementation sentencepiece. This approach uses a simple unigram model of word pieces, and optimizes its probabilities on the training data using expectation maximization training. The algorithm that finds the optimal vocabulary is less systematic than the byte pair encoding algorithm above, instead starting with a big heuristics-based vocabulary and decreasing its size during the training process.

Of course, word segmentation has always been important for languages that do not use spaces, such as Chinese and Japanese. It has been useful for languages that allow compound nouns too, for instance Dutch and German. However, decreasing the vocabulary size for an English retrieval task seems a counter-intuitive approach, certainly given the amount of work on increasing the vocabulary size by adding phrases. Increased vocabularies for retrieval were for instances evaluated by Mandar Mitra et al. and Andrew Turpin and Alistair Moffat, and our own Kees Koster and Marc Seuter, but almost always with little success.

As a final thought, I think it is interesting that a successful deep neural model like BERT uses good old statistical NLP for deriving its word piece vocabulary. The word pieces literally form the basis of BERT, but they are not based on a neural network approach themselves. I believe that in the coming years, researchers will start to replace many of the typical neural approaches, like back propagation and softmax normalization by more principled approaches like expectation maximization and maximum likelihood estimation. But I haven’t been nearly as often right about the future as Chris Manning, so don’t take my word for it.

Abhishta defends PhD thesis on the impacts of DDoS attacks

The Blind Man and the Elephant: Measuring Economic Impacts of DDoS Attacks

by Abhishta

Internet has become an important part of our everyday life. We use services like Netflix, Skype, online banking and scopus etc. daily. We even use Internet for filing our taxes and communicating with municipality. This dependency on network-based technologies also provides an opportunity to malicious actors in our society to remotely attack IT infrastructure. One such cyberattack that may lead to unavailability of network resources is known as distributed denial of service (DDoS) attack. A DDoS attack leverages many computers to launch a coordinated Denial of Service attack against one or more targets.
These attacks cause damages to victim businesses. According to reports published by several consultancies and security companies these attacks lead to millions of dollars in losses every year. One might ponder: are the damages caused by temporary unavailability of network services really this large? One of the points of criticism for these reports has been that they often base their findings on victim surveys and expert opinions. Now, as cost accounting/book keeping methods are not focused on measuring the impact of cyber security incidents, it is highly likely that surveys are unable to capture the true impact of an attack. A concerning fact is that most C-level managers make budgetary decisions for security based on the losses reported in these surveys. Several inputs for security investment decision models such as return on security investment (ROSI) also depend on these figures. This makes the situation very similar to the parable of the blind men and the elephant, who try to conceptualise how the elephant looks like by touching it. Hence, it is important to develop methodologies that capture the true impact of DDoS attacks. In this thesis, we study the economic impact of DDoS attacks on public/private organisations by using an empirical approach.

[download thesis]

FFORT: A benchmark suite for fault tree analysis

by Enno Ruijters, Carlos Budde, Muhammad Nakhaee, Mariëlle Stoelinga, Doina Bucur, Djoerd Hiemstra, and Stefano Schivo

This paper presents FFORT (the Fault tree FOResT): A large, diverse, extendable, and open benchmark suite consisting of fault tree models, together with relevant metadata. Fault trees are a common formalism in reliability engineering, and the FFORT benchmark brings together a large and representative suite of fault tree models. The benchmark provides each fault tree model in standard Galileo format, together with references to its origin, and a textual and/or graphical description of the tree. This includes quantitative information such as failure rates, and the results of quantitative analyses of standard reliability metrics, such as the system reliability, availability and mean time to failure. Thus, the FFORT benchmark provides:(1) Examples of how fault trees are used in various domains; (2) A large class of tree models to evaluate fault tree methods and tools; (3) Results of analyses to compare newly developed methods with the benchmark results. Currently, the benchmark suite contains 202 fault tree models of great diversity in terms of size, type, and application domain. The benchmark offers statistics on several relevant model features, indicating e.g. how often such features occur in the benchmark, as well as search facilities for fault tree models with the desired features. Inaddition to the trees already collected, the website provides a user-friendly submission page, allowing the general public to contribute with more fault trees and/or analysis results with new methods. Thereby, we aim to provide an open-access, representative collection of fault trees at the state of the art in modeling and analysis.

Presented at the 29th European Safety and Reliability Conference (ESREL 2019) in Hannover, Germany

[download pdf]

Wim van der Zijden graduates on Multi-Tenant Customizable Databases

by Wim van der Zijden

A good practice in business is to focus on key activities. For some companies this may be branding, while other businesses may focus on areas such as consultancy, production or distribution. Focusing on key activities means to outsource as much other activities as possible. These other activities merely distract from the main goals of the company and the company will not be able to excel in them.
Many companies are in need of reliable software to persistently process live data transactions and enable reporting on this data. To fulfil this need, they often have large IT departments in-house. Those departments are costly and distract from the company’s main goals. The emergence of cloud computing should make this no longer necessary. All they need is an internet connection and a service contract with an external provider.
However, most businesses are in need of highly customizable software, because each company has slightly different business processes, even those in the same industry. So even if they outsource their IT need, they will still have to pay expensive developers and business analysts to customize some existing application.
These issues are addressed by Multi-Tenant Customizable (MTC) applications. We define such an application as follows:

A single software solution that can be used by multiple organizations at the same time and which is highly customizable for each organization and user within that organization, by domain experts without a technical background.

A key challenge in designing such a system is to develop a proper persistent data storage, because mainstream databases are optimized for single tenant usage. To this end this Master’s thesis consists of two papers: the first paper proposes an MTC-DB Benchmark, MTCB. This Benchmark allows for objective comparison and evaluation of MTC-DB implementations, as well as providing a framework for the definition of MTC-DB. The second paper describes a number of MTC-DB implementations and uses the benchmark to evaluate those implementations.

[download pdf]

Data Science Platform Netherlands

Data Science Platform Netherlands

The Data Science Platform Netherlands (DSPN) is the national platform for ICT research within the Data Science domain. Data Science is the collection and analysis of so-called ‘Big Data’ according to academic methodology. DSPN unites all Dutch academic research institutions where Data Science is carried out from an ICT perspective, specifically the computer science or applied mathematics perspectives. The objectives of DSPN are to:

  • Highlight the importance of ICT research in Big Data and Data Science, especially in national discussions about research and education.
  • Exchange and disseminate information about Data Science research and education.
  • Build and maintain a network of ICT researchers active in the field of Data Science.

DSPN is launched as part of the ICT Research Platform Netherlands (IPN) to give a voice to the Data Science initiatives of the Dutch ICT research organisations. For more information, see the website at: http://www.datascienceplatform.org/.

Luhn Revisited: Significant Words Language Models

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance – in particular when the initial query retrieves only little relevant information – when overfitting to accidental features of the particular observed feedback documents. Inspired by the early work of Hans Peter Luhn, we propose significant words language models of feedback documents that capture all, and only, the significant shared terms from feedback documents. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of rare terms that are only explained by specific feedback documents, which eventually results in having only the significant terms left in the feedback model.

Establishing a set of 'Significant Words'

Our main contributions are the following. First, we present significant words language models as the effective models capturing the essential terms and their probabilities. Second, we apply the resulting models to the relevance feedback task, and see a better performance over the state-of-the-art methods. Third, we see that the estimation method is remarkably robust making the models insensitive to noisy non-relevant terms in feedback documents. Our general observation is that the significant words language models more accurately capture relevance by excluding general terms and feedback document specific terms.

To be presented at the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) on October 24-28, 2016 in Indianapolis, United States.

[download pdf]

3TU NIRICT theme Data Science

The main objective of the NIRICT research in Data Science is to study the science and technology to unlock the intelligence that is hidden inside Big Data.
The amounts of data that information systems are working with are rapidly increasing. The explosion of data happens in a pace that is unprecedented and in our networked world of today the trend is even accelerating. Companies have transactional data with trillions of bytes of information about their customers, suppliers and operations. Sensors in smart devices generate unparalleled amounts of sensor data. Social media sites and mobile phones have allowed billions of individuals globally to create their own enormous trails of data.
The driving force behind this data explosion is the networked world we live in, where information systems, organizations that employ them, people that use them, and processes that they support are connected and integrated, together with the data contained in those systems.

What happens in an internet minute in 2016?

Unlocking the Hidden Intelligence

Data alone is just a commodity, it is Data Science that converts big data into knowledge and insights. Intelligence is hidden in all sorts of data and data systems.
Data in information systems is usually created and generated for specific purposes: it is mostly designed to support operational processes within organizations. However, as a by-product, such event data provide an enormous source of hidden intelligence about what is happening, but organizations can only capitalize on that intelligence if they are able to extract it and transform the intelligence into novel services.
Analyzing the data provides opportunities for organizations to gather intelligence to capitalize historic and current performance of their processes and exploit future chances for performance improvement.
Another rich source of information and insights is data from the Social Web. Analyzing Social Web Data provides governments, society and companies with better understanding of their community and knowledge about human behavior and preferences.
Each 3TU institute has its own Data Science program, where local data science expertise is bundled and connected to real-world challenges.

Delft Data Science (DDS) – TU Delft
Scientific director: Prof. Geert-Jan Houben

Data Science Center Eindhoven (DSC/e) – TU/e
Scientific director: Prof. Wil van der Aalst

Data Science Center UTwente (DSC UT) – UT
Scientific director: Dr. Djoerd Hiemstra

More information at: https://www.3tu.nl/nirict/en/Research/data-science/.

Data Science Day

On 20 April, we organize a Data Science Day in the DesignLab. Invited speakers at the Data Science Colloquium are Piet Daas, methodologist and big data research coordinator of the CBS (Centraal Bureau Statistiek) who will talk about big data from Twitter and Facebook as a data source for official statistics; Rolf de By and Raul Zurita Milla, professors of ITC Geo-Information Science and Earth Observation, will talk about remote sensing techniques, using satelites and drones for helping economies in poor areas in the world, a prestigious project funded by the Bill and Melinda Gates Foundation; and Jan Willem Tulp creator of interactive data visualisations for magazines like Scientific American and Popular Science, as well as companies, for instance the Tax Free Retail Analysis Tool for Schiphol Amsterdam Airport.

The Data Science colloquia are kindly sponsored by the CTIT and the Netherlands Research School for Information and Knowledge Systems (SIKS) and part of the SIKS educational program.

[more information]