BERT: Bidirectional Encoder Representations from Transformers

Pre-training of Deep Bidirectional Transformers for Language Understanding

Published

October 11, 2018

Paper

arXiv

Code

Github

Takeaways

BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
The pre-trained BERT model can be fine-tuned with just one additional output layer to create SOTA models for a wide range of tasks.

Introduction

Existing strategies for applying pre-trained representations to downstream tasks:

feature-based: use task-specific architectures that include the pre-trained representations as additional features (e.g., ELMo).
fine-tuning: introduce minimal task-specific parameters, and fine-tune all pre-trained parameters on downstream tasks (e.g., GPT).

GPT uses unidirectional language modeling. This limits the choice of architectures that can be used during pre-training. Such restrictions are sub-optimal for sentence-level tasks and could be very harmful when applying fine-tuning based approaches to token-level tasks.

This work:

improves the fine-tuning based approaches,
replaces the left-to-right language model by the masked language model (bidirectional),
use a “next sentence prediction” task.

Methods

Two steps:

Pre-training: BERT is trained on unlabeled data over different pre-training tasks.
Fine-Tuning: BERT is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

Overall pre-training and fine-tuning procedures for BERT.

Model

BERT uses a multi-layer bidirectional Transformer encoder, i.e., a single stack of transformer layers.

The BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to the context to its left.

Input/Output Representation

To handle a variety of downstream tasks, BERT takes a single sentence or a pair of sentences (e.g., question and answer) as the input token sequence.

Token:

BERT uses WordPiece embeddings with a 30,000 token vocabulary.
The first token of every sequence is a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
Sentence pairs are separated with a special token ([SEP]). A learned embedding is added to every token indicating whether it belongs to sentence A or sentence B.

For a given token, its input representation is constructed by summing the corresponding token, segment (sentence A or B), and position embeddings.

Pre-Training BERT

Task 1: Masked LM

Intuitively, a deep bidirectional model is more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model.

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”.

Solution: Mask Language Model (MLM), which masks some percentage of the input tokens at random, and then predicts those masked tokens.

BERT “masks” 15% of all tokens in each sequence at random.
- 80%: replaced by the [MASK] token
- 10%: replaced by a random token
- 10%: unchanged
The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary.
Only predict the masked words rather than reconstruct the entire input.

Task 2: Next Sentence Prediction (NSP)

Motivation: Many downstream tasks are based on understanding the relationship between two sentences, which is not directly captured by LM.

Choosing the sentences A and B for each pre-training example, s.t.
- 50% of the time B is the actual next sentence that follows A (labeled as IsNext),
- 50% of the time it is a random sentence from the corpus (labeled as NotNext).
The final hidden vector of the [CLS] token is used for next sentence prediction.

The NSP task is beneficial to both QA and NLI.

Pre-Training Data

BooksCorpus (800M words)
English Wikipedia (2,500M words)

It is critical to use a document-level corpus rather than a shuffled sentence-level corpus in order to extract long contiguous sequences.

Fine-Tuning BERT

For each task, simply plug in the tasks-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.

Input sentence pairs

Paraphrasing: sentence pairs
Entailment: hypothesis-premise pairs
Question answering: question-passage pairs
Text classification or sequence tagging: degenerate text-\(\varnothing\) pairs

Output

Token level tasks (e.g. sequence tagging or question answering): the token representations are fed into an output layer.
Text classification (e.g., entailment or sentiment analysis): the [CLS] representation is fed into an output layer.

Experiments

GLUE

The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks.

Results: Both BERT-base and BERT-large outperform all systems on all tasks by a substantial margin.

SQuAD

v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.

Represent the input question and passage as a single packed sequence (A: question, B: passage).

Only introduce a start vector \(S\in\mathbb{R}^H\) and an end vector \(E\in\mathbb{R}^H\) (trainable weight vectors).

The probability of word \(i\) being the start (or end) of the answer span is computed as a dot product between the output representation \(T_i\) and \(S\) (or \(E\)) followed by a softmax over all of the words in the paragraph:

\[P_i^S = \frac{e^{S\cdot T_i}}{\sum_j e^{S\cdot T_j}} \qquad\text{ or }\qquad P_i^E\frac{e^{E\cdot T_i}}{\sum_j e^{E\cdot T_j}}.\]

Loss: the sum of the log-likelihoods of the correct start and end positions.

Prediction: Given a span \((i,j)\), compute the score as \(S\cdot T_i + E\cdot T_j\), and use the maximum scoring span where \(j\ge i\) as a prediction.

v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

Model modification: treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. For prediction, the score of the no-answer span: \(s_\texttt{null}\) is compared to the best non-null span \(\hat{s}_{i,j}\). A non-null answer is chosen when \(\hat{s}_{i,j}>s_\texttt{null}+\tau\), where \(\tau\) is a threshold. \(s_\texttt{null}=S\cdot C+ E\cdot C\), where \(C\) is the output representation of the [CLS] token.

Results: BERT outperforms the previous best systems on both SQuAD v1.1 and v2.0.

SWAG

The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference. Given a sentence, the task is to choose the most plausible continuation among four choices.

Input: construct four input sequences, each is the concatenation of the given sentence A and a possible continuation B.

Output: introduce a vector whose dot product with the [CLS] token representation \(C\) denotes a score for each choice.

Results: BERT-large outperforms the ELMo and GPT.

Ablation Studies

Effect of Pre-training Tasks

Different pre-training tasks:

BERT
No NSP: BERT without the “next sentence prediction” task
LTR & No NSP: A standard Left-to-Right (LTR) LM without NSP, similar to GPT

Results:

Removing NSP hurts the performance of BERT.
The LTR model performs worse than the MLM model on all tasks

Effect of Model Size

Larger models lead to a strict accuracy improvement across different datasets.
This work demonstrates that scaling to extreme model sizes also leads to large improvements on very small-scale tasks, provided that the model has been sufficiently pre-trained.
Previous work shows evidence that larger models might not help the feature-based approaches.

Feature-based Approach with BERT

Advantages of the feature-based approaches:

Not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added.
There are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.

To ablate the fine-tuning approach, the authors apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT. These contextual embeddings are used as input to a randomly initialized two-layer BiLSTM before the classification layer.

Results:

Fine-tuning outperforms feature-based approaches.
The best-performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which achieves performance similar to fine-tuning.
BERT is effective for both finetuning and feature-based approaches.