GPT-3

Language Models are Few-Shot Learners

Published

July 22, 2020

Paper

arXiv

Takeaways

Although recent large pre-training models are task-agnostic in architecture, they still require task-specific fine-tuning datasets of thousands of examples.
The author train GPT-3, an autoregressive LM with 175B parameters, and test its performance in the few-shot setting, i.e., providing task descriptions and few-shot demonstrations purely via text interaction (prompt), and without any gradient updates.
This work shows that scaling up LM greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior SOTA finetuning approaches.

Introduction

The trend of pre-trained language representation NLP

single-layer pre-trained word embedding + task-specific architectures
multiple layers of representations (e.g. RNN) + task-specific architectures
pre-train RNNs or Transformers, and then directly fine-tune, them without task-specific architectures.

Problems with Pre-training + Fine-tune

Although the last paradigm uses task-agnostic architectures, they still require task-specific fine-tuning.
The need for a large dataset of labeled examples for every new task limits the applicability of LMs.
The potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution.
Humans do not require large supervised datasets to learn most language tasks. A brief directive in natural language + a tiny number of examples is often sufficient.

Meta Learning

In the context of LMs, Meta learning means the model develops a broad set of skills at training time and then uses those abilities at inference time to rapidly adapt to or recognize the desired task.

GPT-2 attempts to do this via what “in-context learning”: the model is conditioned on natural language instruction and/or a few demonstrations of the task.

While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning

This work shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches.

Model Scale

Model	GPT	BERT	GPT-2	Megatron-LM	T5	Turing-NLG
# of parameters	100M	300M	1.5B	8B	11B	17B

There is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

The authors test this hypothesis by training a 175B parameter autoregressive language model (GPT-3) and measuring its in-context learning abilities (few-shot, one-shot, and zero-shot).

Methods

Different Settings

Fine-Tuning (FT)

A pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used.
The main advantage is strong performance on many benchmarks.
The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution, and the potential to exploit spurious features of the training data, potentially resulting in an unfair comparison with human performance.
In this work, the authors do not fine-tune GPT-3.

Few-Shot (FS)

Refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed.
The number of samples \(K\) is in the range of 10 to 100 as this is how many examples can fit in the model’s context window (nctx = 2048).
The main advantages: A major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset.
The main disadvantage: Results from this method have so far been much worse than SOTA fine-tuned models. Also, a small amount of task-specific data is still required.

One-Shot (1S)

It is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task.
The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans.

Zero-Shot (0S)

No demonstrations are allowed, and the model is only given a natural language instruction describing the task.
This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations.
But it is also the most challenging setting.

Model

The same model and architecture as GPT-2, with the exception that

GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer.
To study the dependence of ML performance on model size, 8 different sizes of model were trained, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model called GPT-3.

Training Dataset

Unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, the authors took 3 steps to improve the average quality of the datasets:

Downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora;
Performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of held-out validation set as an accurate measure of overfitting;
Added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

The overall training dataset has about 500B tokens.

Experiments

Evaluation

For few-shot learning, the authors evaluate each example in the evaluation set by randomly drawing \(K\) examples from that task’s training set as conditioning

Multiple-Choice Problems

Provide \(K\) examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion.

For most tasks, the per-token likelihood (to normalize for length) is compared. However, sometime it might be beneficial to normalize by the unconditional probability of each completion, by computing

\[\frac{P(\texttt{completion}|\texttt{context})}{P(\texttt{completion}|\texttt{answer context})},\]

where answer context is the string "Answer: " or "A: ".

Binary Classification

Give the options more semantically meaningful names (e.g. "True" or "False" rather than 0 or 1) and then treat the task like multiple choice.

Free-Form Completion

Use beam search with a beam width of 4 and a length penalty of \(\alpha = 0.6\).

Language Modeling, Cloze, and Completion

Language Modeling

Evaluate the zero-shot GPT-3 by computing the perplexity on the Penn Tree Bank dataset. GPT-3 sets a new SOTA compared to GPT-2.

LAMBADA

Task: The model is asked to predict the last word of sentences which requires reading a paragraph of context.

The authors use a fill-in-the-blank format to guide GPT-3 to predict a word rather than other valid continuations of the paragraph:

Alice was friends with Bob. Alice went to visit her friend ___. → Bob
George bought some baseball equipment, a ball, a glove, and a ___. →

Results:

GPT-3 achieves new SOTA on LAMBADA.
Few-shot performance improves strongly with model size.
The one-shot setting always performs worse than the zero-shot setting.

HellaSwag

Task: Pick the best ending to a story or set of instructions.

Results: The performance of GPT-3 on this task is a fair amount lower than the overall SOTA.

StoryCloze

Task: Select the correct ending sentence for a five-sentence long story.

Results: GPT-3 is better than previous zero-shot results but still underperforms fine-tuned SOTA.

Question Answering

open-book QA: use an information retrieval system to find relevant text and train a model to generate an answer given the question and the retrieved text.

closed-book QA: train a model to answer the questions directly.

Results:

Overall, on one of the three datasets GPT-3’s one-shot matches the open-book fine-tuning SOTA.
On the other two datasets, it approaches the performance of the closed-book SOTA despite not using fine-tuning.

Translation

For GPT-2 a filter was used on a multilingual collection of documents to produce an English-only dataset due to capacity concerns. Since the capacity increases by over two orders of magnitude from GPT-2 to GPT-3, the scope of the training dataset is also expanded to include more representation of other languages. the majority of the data is derived from raw Common Crawl with only quality-based filtering. Although GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages.

Zero-shot/one-shot/few-shot GPT-3 underperforms, nears competitive performance, and achieves similar average performance to prior unsupervised NMT work.

GPT-3 has a noticeable skew in its performance depending on language direction. GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction.

Winograd-Style Tasks

Task: Determine which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human.

Common Sense Reasoning

Task: Capture physical or scientific reasoning

Results: Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks.

Reading Comprehension

Results:

A wide spread is observed in GPT-3’s performance across 5 datasets suggestive of varying capability with different answer formats.
In general, GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.

SuperGLUE

Results: The average performance of few-shot GPT-3 matches that of a fine-tuned BERT model.

Natural Language Inference

NLI concerns the ability to understand the relationship between two sentences.

Task: a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral)

Results: NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

Synthetic and Qualitative Tasks

Arithmetic

Results: Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.

Word Scrambling and Manipulation Tasks

Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word.

Results:

The one-shot performance is significantly weaker than the few-shot setting.
In the zero-shot setting, the model can rarely perform any of the tasks. This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data.

News Article Generation

Few-shot learning: Provide three previous news articles and the title and subtitle of a proposed next article in the model’s context to condition it.

Results:

With prompt, the model is able to reliably generate short articles in the “news” genre.
Human abilities to detect model-generated text appear to decrease as model size increases.

Learning and Using Novel Words

Task: using a word in a sentence after seeing it defined only once.

Results: Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

Correcting English Grammar

Prompt: Poor English Input: <sentence>\n Good English Output: <sentence>.

Limitations

Despite the strong quantitative and qualitative improvements of GPT-3, it still has notable weaknesses in text synthesis and several NLP tasks
GPT-3 has several structural and algorithmic limitations: do not include any bidirectional architectures or other training objectives such as denoising.
Scaling up any LM-like model may eventually run into (or could already be running into) the limits of the pretraining objective.
Poor sample efficiency during pre-training.
Ambiguity about whether few-shot learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it has learned during training.