BERT: Bidirectional Encoder Representations from Transformers

Pre-training of Deep Bidirectional Transformers for Language Understanding



Existing strategies for applying pre-trained representations to downstream tasks:

GPT uses unidirectional language modeling. This limits the choice of architectures that can be used during pre-training. Such restrictions are sub-optimal for sentence-level tasks and could be very harmful when applying fine-tuning based approaches to token-level tasks.

This work:


Two steps:

  1. Pre-training: BERT is trained on unlabeled data over different pre-training tasks.
  2. Fine-Tuning: BERT is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

Overall pre-training and fine-tuning procedures for BERT.


BERT uses a multi-layer bidirectional Transformer encoder, i.e., a single stack of transformer layers.

The BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to the context to its left.

Input/Output Representation

To handle a variety of downstream tasks, BERT takes a single sentence or a pair of sentences (e.g., question and answer) as the input token sequence.


For a given token, its input representation is constructed by summing the corresponding token, segment (sentence A or B), and position embeddings.

Pre-Training BERT

Task 1: Masked LM

Intuitively, a deep bidirectional model is more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model.

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”.

Solution: Mask Language Model (MLM), which masks some percentage of the input tokens at random, and then predicts those masked tokens.

  1. BERT “masks” 15% of all tokens in each sequence at random.
    • 80%: replaced by the [MASK] token
    • 10%: replaced by a random token
    • 10%: unchanged
  2. The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary.
  3. Only predict the masked words rather than reconstruct the entire input.

Task 2: Next Sentence Prediction (NSP)

Motivation: Many downstream tasks are based on understanding the relationship between two sentences, which is not directly captured by LM.

  1. Choosing the sentences A and B for each pre-training example, s.t.
    • 50% of the time B is the actual next sentence that follows A (labeled as IsNext),
    • 50% of the time it is a random sentence from the corpus (labeled as NotNext).
  2. The final hidden vector of the [CLS] token is used for next sentence prediction.

The NSP task is beneficial to both QA and NLI.

Pre-Training Data

It is critical to use a document-level corpus rather than a shuffled sentence-level corpus in order to extract long contiguous sequences.

Fine-Tuning BERT

For each task, simply plug in the tasks-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.

Input sentence pairs




The General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse natural language understanding tasks.

Results: Both BERT-base and BERT-large outperform all systems on all tasks by a substantial margin.



The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.

Represent the input question and passage as a single packed sequence (A: question, B: passage).

Only introduce a start vector \(S\in\mathbb{R}^H\) and an end vector \(E\in\mathbb{R}^H\) (trainable weight vectors).

The probability of word \(i\) being the start (or end) of the answer span is computed as a dot product between the output representation \(T_i\) and \(S\) (or \(E\)) followed by a softmax over all of the words in the paragraph:

\[P_i^S = \frac{e^{S\cdot T_i}}{\sum_j e^{S\cdot T_j}} \qquad\text{ or }\qquad P_i^E\frac{e^{E\cdot T_i}}{\sum_j e^{E\cdot T_j}}.\]

Loss: the sum of the log-likelihoods of the correct start and end positions.

Prediction: Given a span \((i,j)\), compute the score as \(S\cdot T_i + E\cdot T_j\), and use the maximum scoring span where \(j\ge i\) as a prediction.


The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

Model modification: treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. For prediction, the score of the no-answer span: \(s_\texttt{null}\) is compared to the best non-null span \(\hat{s}_{i,j}\). A non-null answer is chosen when \(\hat{s}_{i,j}>s_\texttt{null}+\tau\), where \(\tau\) is a threshold. \(s_\texttt{null}=S\cdot C+ E\cdot C\), where \(C\) is the output representation of the [CLS] token.

Results: BERT outperforms the previous best systems on both SQuAD v1.1 and v2.0.


The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference. Given a sentence, the task is to choose the most plausible continuation among four choices.

Input: construct four input sequences, each is the concatenation of the given sentence A and a possible continuation B.

Output: introduce a vector whose dot product with the [CLS] token representation \(C\) denotes a score for each choice.

Results: BERT-large outperforms the ELMo and GPT.

Ablation Studies

Effect of Pre-training Tasks

Different pre-training tasks:


Effect of Model Size

Feature-based Approach with BERT

Advantages of the feature-based approaches:

To ablate the fine-tuning approach, the authors apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT. These contextual embeddings are used as input to a randomly initialized two-layer BiLSTM before the classification layer.