Language Models are Few-Shot Learners



The trend of pre-trained language representation NLP

  1. single-layer pre-trained word embedding + task-specific architectures
  2. multiple layers of representations (e.g. RNN) + task-specific architectures
  3. pre-train RNNs or Transformers, and then directly fine-tune, them without task-specific architectures.

Problems with Pre-training + Fine-tune

Meta Learning

In the context of LMs, Meta learning means the model develops a broad set of skills at training time and then uses those abilities at inference time to rapidly adapt to or recognize the desired task.

GPT-2 attempts to do this via what “in-context learning”: the model is conditioned on natural language instruction and/or a few demonstrations of the task.

While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning

This work shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches.

Model Scale

Model GPT BERT GPT-2 Megatron-LM T5 Turing-NLG
# of parameters 100M 300M 1.5B 8B 11B 17B

There is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

The authors test this hypothesis by training a 175B parameter autoregressive language model (GPT-3) and measuring its in-context learning abilities (few-shot, one-shot, and zero-shot).


Different Settings

Fine-Tuning (FT)

Few-Shot (FS)

One-Shot (1S)

Zero-Shot (0S)


The same model and architecture as GPT-2, with the exception that

  1. GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer.
  2. To study the dependence of ML performance on model size, 8 different sizes of model were trained, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model called GPT-3.

Training Dataset

Unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, the authors took 3 steps to improve the average quality of the datasets:

  1. Downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora;
  2. Performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of held-out validation set as an accurate measure of overfitting;
  3. Added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

The overall training dataset has about 500B tokens.



For few-shot learning, the authors evaluate each example in the evaluation set by randomly drawing \(K\) examples from that task’s training set as conditioning

Multiple-Choice Problems

Provide \(K\) examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion.

For most tasks, the per-token likelihood (to normalize for length) is compared. However, sometime it might be beneficial to normalize by the unconditional probability of each completion, by computing

\[\frac{P(\texttt{completion}|\texttt{context})}{P(\texttt{completion}|\texttt{answer context})},\]

where answer context is the string "Answer: " or "A: ".

Binary Classification

Give the options more semantically meaningful names (e.g. "True" or "False" rather than 0 or 1) and then treat the task like multiple choice.

Free-Form Completion

Use beam search with a beam width of 4 and a length penalty of \(\alpha = 0.6\).

Language Modeling, Cloze, and Completion

Language Modeling

Evaluate the zero-shot GPT-3 by computing the perplexity on the Penn Tree Bank dataset. GPT-3 sets a new SOTA compared to GPT-2.


Task: The model is asked to predict the last word of sentences which requires reading a paragraph of context.

The authors use a fill-in-the-blank format to guide GPT-3 to predict a word rather than other valid continuations of the paragraph:

Alice was friends with Bob. Alice went to visit her friend ___. → Bob
George bought some baseball equipment, a ball, a glove, and a ___. →



Task: Pick the best ending to a story or set of instructions.

Results: The performance of GPT-3 on this task is a fair amount lower than the overall SOTA.


Task: Select the correct ending sentence for a five-sentence long story.

Results: GPT-3 is better than previous zero-shot results but still underperforms fine-tuned SOTA.

Question Answering

open-book QA: use an information retrieval system to find relevant text and train a model to generate an answer given the question and the retrieved text.

closed-book QA: train a model to answer the questions directly.



For GPT-2 a filter was used on a multilingual collection of documents to produce an English-only dataset due to capacity concerns. Since the capacity increases by over two orders of magnitude from GPT-2 to GPT-3, the scope of the training dataset is also expanded to include more representation of other languages. the majority of the data is derived from raw Common Crawl with only quality-based filtering. Although GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages.

Zero-shot/one-shot/few-shot GPT-3 underperforms, nears competitive performance, and achieves similar average performance to prior unsupervised NMT work.

GPT-3 has a noticeable skew in its performance depending on language direction. GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction.

Winograd-Style Tasks

Task: Determine which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human.

Common Sense Reasoning

Task: Capture physical or scientific reasoning

Results: Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks.

Reading Comprehension



Results: The average performance of few-shot GPT-3 matches that of a fine-tuned BERT model.

Natural Language Inference

NLI concerns the ability to understand the relationship between two sentences.

Task: a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral)

Results: NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

Synthetic and Qualitative Tasks


Results: Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.

Word Scrambling and Manipulation Tasks

Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word.


News Article Generation

Few-shot learning: Provide three previous news articles and the title and subtitle of a proposed next article in the model’s context to condition it.


Learning and Using Novel Words

Task: using a word in a sentence after seeing it defined only once.

Results: Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

Correcting English Grammar

Prompt: Poor English Input: <sentence>\n Good English Output: <sentence>.