Language Models are Unsupervised Multitask Learners



Supervised learning systems are brittle and sensitive to changes in distribution and task (“narrow experts”). The prevalence of single-task training on single-domain datasets might be a major contributor to the lack of generalization observed in current systems.

Multitask learning is a promising framework for improving general performance. However, multi-task training in NLP is still nascent.

The trend of pre-trained language representation NLP:

  1. single-layer pre-trained word embedding + task-specific architectures
  2. multiple layers of representations (e.g. RNN) + task-specific architectures
  3. pre-train RNNs or Transformers, and then directly fine-tune, without task-specific architectures (e.g. GPT, BERT).

This work: LM can perform a wide range of downstream tasks in a zero-shot setting, without any parameter or architecture modification.


Language Model

Training LMs in a probabilistic framework as estimating a conditional distribution of the output given the input and the task information (multi-task/meta-learning),

\[p(\texttt{output}|\texttt{input}, \texttt{task})\]

The language model is auto-regressive, i.e., it predicts the next word given the previous words.

Task Conditioning


LMs with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If LMs are able to do this it will be, in effect, performing unsupervised multitask learning.

Test this by analyzing the performance of LMs in a zero-shot setting on a wide variety of tasks.

Training Dataset

Although web scrapes such as Common Crawl are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues.

The authors created a new web scrape that emphasizes document quality.

  1. Only scraped web pages that have been curated/filtered by humans: scraped all outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
  2. Extract the text from HTML responses
  3. Ee-duplication and some heuristic-based cleaning
  4. Removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

Results in over 8 million documents for a total of 40 GB of text (about 40 Billion Bytes).

Input Representation

A general LM should be able to compute the probability of (and also generate) any string.

While processing Unicode strings as a sequence of UTF-8 bytes elegantly fulfills this requirement, current byte-level LMs are not competitive with word-level LMs on large-scale datasets.

Byte Pair Encoding (BPE)

An interpolation between word-level inputs for frequent symbol sequences and character-level inputs for infrequent symbol sequences.

BPE on Unicode code points: The size of the base vocabulary is too large (> 130,000) compared to the 32,000 to 64,000 token vocabularies often used with BPE.

BPE on Byte Level

  1. A base vocabulary of size 256
  2. Naive BPE results in suboptimal merges due to the greedy strategy. To avoid this, the authors prevent BPE from merging across character categories, with an exception for spaces.
  3. Enable the model to assign a probability to any Unicode string.


The model largely follows the details of the GPT model with a few modifications.

  1. LayerNorm was moved to the input of each sub-block and an additional LayerNorm was added after the final self-attention block.
  2. A modified initialization that accounts for the accumulation on the residual path with model depth is used. Scale the weights of residual layers at initialization by a factor of \(1/\sqrt{N}\), where \(N\) is the number of residual layers.
  3. The vocabulary is expanded to 50,257.
  4. We also increase the context size from 512 to 1024 tokens,
  5. A larger batch size of 512 is used.

The largest model (GPT-2) has 1.5B parameters.


Language Modeling

This is the primary task the models are trained for.

Task: Evaluating the log probability of different datasets according to the LM.

Results: GPT-2 transfers well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting.


The LAMBADA dataset tests the ability of systems to model long-range dependencies in text. The

Task: predict the final word of sentences that require at least 50 tokens of context for a human to successfully predict.

Results: GPT-2 improves the SOTA.

Common error: valid continuations of the sentence, but not valid final words.

This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy.

Reading Comprehension

The Conversation Question Answering dataset (CoQA) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document. CoQA tests reading comprehension capabilities and also the ability of models to answer questions that depend on conversation history (such as “Why?”).

Use greedy decoding from GPT-2 conditioned on a document, the history of the associated conversation, and a final token A:.

Results: GPT-2 matches or exceeds the performance of 3 out of 4 baseline systems, and underperforms the supervised SOTA (BERT-based)


Induce summarization behavior by adding the text TL;DR: after the article and generating 100 tokens with Top-\(k\) random sampling with \(k = 2\) which reduces repetition and encourages more abstractive summaries than greedy decoding. The first 3 generated sentences in these 100 tokens are used as the summary.



Help GPT-2 infer the translation task, by conditioning the LM on a context of example pairs of the format english sentence = french sentence followed by a final prompt of english sentence =. After the prompt, outputs are sampled with greedy decoding and the first generated sentence is used as the translation.


Note: Since non-English webpages were filtered from WebText, it only contains a very small (10MB) corpus in the Frech language.

Question Answering

Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset.

Results: The performance of GPT-2 is still much, much, worse than the existing open domain question answering systems.