T5: Text-to-Text Transfer Transformer

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer



Motivation: There is a need for a more rigorous understanding of the contributions of different components in transfer learning for NLP (large-scale pre-training models), e.g., different models, pre-training objectives, datasets, and fine-tuning methods.

The basic idea: Introduce a unified framework (T5) that converts all text-based language problems into a text-to-text format. The text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task considered.

This work primarily comprises a survey, exploration, and empirical comparison of existing techniques, and explores the limits of current approaches by scaling up the insights (training models up to 11 B parameters on dataset up to 750GB)



T5 closely follows the original Transformer.

Main differences:

T5 uses an encoder-decoder architecture as in the original Transformer. In comparison, GPT, GPT-2, BERT use a single stack of Transformer layers.


The Colossal Clean Crawled Corpus (C4), ~ 750 GB.

  1. Start with Common Crawl
  2. Retain lines that ended in a terminal punctuation mark.
  3. Discarded any page with fewer than 5 sentences and only retained lines that contained at least 3 words.
  4. Remove any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.
  5. Remove any line with the word Javascript.
  6. Remove any page where the phrase “lorem ipsum” (placeholder) appeared.
  7. Removed any pages that contained a curly bracket to avoid pages with code.
  8. Discarded all but one of any three-sentence span occurring more than once in the data set.
  9. Filter out non-English pages

Input and Output Format

Cast all of the tasks considered into a “text-to-text” format, i.e., a task where the model is fed some text for context or conditioning and is then asked to produce some output text.

The text-to-text framework provides a consistent training objective both for pre-training and fine-tuning.

T5 is trained with a maximum likelihood objective (using “teacher forcing”, i.e., using ground truth as input, instead of model output from a prior time step as an input) and a cross-entropy loss regardless of the task. To specify which task the model should perform, a task-specific (text) prefix is added to the original input sequence before feeding it to the model.

Compare to GPT-2, which also uses prompts:



Baseline Model

A standard encoder-decoder Transformer is designed so that the encoder and decoder are each similar in size and configuration to a BERT-base model.


Use SentencePiece to encode text as WordPiece tokens (use a vocabulary of 32,000 wordpieces)

Trained the SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian. This vocabulary was shared across both the input and output of the model. Note that the vocabulary makes it so that the model can only process a predetermined, fixed set of languages.

Unsupervised Objective

Use the “denoising” objectives, i.e., masked language modeling. The model is trained to predict missing or otherwise corrupted tokens in the input.

Design an objective that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence.

The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence. An example is as follows.

Original text

Thank you for inviting me to your party last week


Thank you for <X> to your party <Y> week


<X> for inviting <Y> last <Z>


Review and compare the following architectural variants.

Different schematics of the Transformer architecture variants.

Different attention mask patterns.

Model Structures

A major distinguishing factor for different architectures is the “mask” used by different attention mechanisms in the model.

Architectures mask # of layer stacks
Encoder-Decoder (e.g. T5) Encoder: Fully-visible, Decoder: Causal 2
Language model (e.g. GPT) Causal 1
Prefix LM Causal with prefix 1

A fundamental and frequently cited drawback of using an LM in the text-to-text setting is that causal masking forces the model’s representation of the \(i\)-th entry of the input sequence to only depend on the entries up until \(i\). This issue can be avoided in a Transformer-based language model simply by changing the masking pattern (Prefix LM).

The main difference between a prefix LM and the BERT architecture is that the classifier is simply integrated into the output layer of the Transformer decoder in the prefix LM.


Considered both the standard language modeling objective and the denoising objective discussed in the previous section.

Language modeling objective:

For models that ingest a prefix before making predictions (the encoder-decoder model and prefix LM), we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions.

For the standard language model, we train the model to predict the entire span from beginning to end.

Denoising objective:

The unsupervised denoising objective is designed for text-to-text models; to adapt it for use with a language model the inputs and targets are concatenated.


Unsupervised Objectives

Explore different unsupervised objectives. Overall, all of the objectives ingest a sequence of token IDs corresponding to a tokenized span of text from our unlabeled text data set. The token sequence is processed to produce a (corrupted) input sequence and a corresponding target. Then, the model is trained as usual with maximum likelihood to predict the target sequence.

Choices of Objectives

Objective Example input Example target
Prefix LM Thank you for inviting me to your party last week .
BERT-style Thank you <M> <M> me to your party apple week . (original text)
Deshuffling party me for your to . last fun you inviting week Thank (original text)
MASS-style Thank you <M> <M> me to your party <M> week . (original text)
I.i.d. noise, replace spans Thank you <X> me to your party <Y> week . <X> for inviting <Y> last <Z>
I.i.d. noise, drop tokens Thank you me to your party week . for inviting last
Random spans Thank you <X> to <Y> week . <X> for inviting me <Y> your party last <Z>


Pre-Training Dataset

Training Strategy

Fine-Tuning Methods

The standard method is to fine-tune all parameters in the model.

Two alternative methods:

The standard method performs best.

Multi-Task Learning

Train the model on multiple tasks simultaneously (the unsupervised task and downstream supervised tasks). For the unified text-to-text framework, “multi-task learning” simply corresponds to mixing data sets together.

In general, multi-task training underperforms pre-training followed by fine-tuning on most tasks.

Combining Multi-Task Learning with Fine-Tuning

The model is pre-trained on all tasks at once but is then fine-tuned on the individual supervised tasks.

Fine-tuning after multi-task pre-training results in comparable performance to the baseline (unsupervised pre-training + supervised fine-tuning). This suggests that using fine-tuning after multi-task learning can help mitigate some of the trade-offs between different mixing rates.


Compared various strategies for taking advantage of additional computing, including training the model on more data, training a larger model, and using an ensemble of models. Each approach conferred a significant boost in performance. Specifically,

Putting It All Together

The final T5 model is as follows.