Training language models to follow instructions with human feedback



Making LMs bigger does not inherently make them better at following a user’s intent, i.e., not aligned with their users.

The model may generate outputs that are untruthful, toxic, or not helpful.

This is because the LM objective used for many recent large LMs, i.e., predicting the next token on a webpage from the internet, is different from the objective “follow the user’s instructions helpfully and safely”.


Main Findings:


High-Level Methodology

  1. Collect demonstration data, and train a supervised policy (supervised fine-tune, SFT).
  2. Collect comparison data, and train a reward model (RM).
  3. Optimize a policy against the reward model using PPO.

Steps 2 and 3 can be iterated continuously.


Prompt Dataset

For each prompt, the task can be

Train the Very First InstructGPT

To train the very first InstructGPT models, the authors asked labelers to write prompts themselves. This is because an initial source of instruction-like prompts is needed to bootstrap the process and these kinds of prompts weren’t often submitted to the regular GPT-3 models on the API.

Three kinds of prompts were written by the labelers:

Datasets for Fine-Tuning

Three different datasets used in the fine-tuning procedure are built from the prompt dataset.

  1. SFT: A prompt is sampled from the prompt dataset, and a labeler writes an answer to this prompt, supervised learning (13k prompts)
  2. RM: A prompt and several model outputs are sampled, and a labeler ranks the outputs from the best to worst. This data is used to train the reward model. (33k prompts)
  3. PPO: Another prompt dataset from the API. This data is used to train PPO with the RM. (31k prompts)

Human Data Collection

Hired a team of about 40 labelers.

During training and evaluation, the alignment criteria may come into conflict.




Model size: The 6B RMs are used because they save computation and the training of 175B RM could be unstable.

Loss function:

\[L(\theta)=-\frac{1}{K \choose 2}\mathbb{E}_{(x,y_w,y_l)\sim D}[\log(\sigma(r_\theta(x,y_w)- r_\theta(x,y_l)))],\]

where \(r_\theta(x,y)\) is the score outputed by the RM for prompt \(x\) and completion \(y\) with parameters \(\theta\), \(y_w\) is the preferred completion out of the pair of \(y_w\) and \(y_l\), and \(D\) is the dataset of human comparisons.


Fine-tune the SFT model using PPO.

The environment is a bandit environment that presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode.

In addition, a per-token KL penalty from the SFT model is added at each token to mitigate overoptimization of the reward model.

The value function is initialized from the RM.

An improved algorithm: PPO-ptx




Definition of Alignment

Evaluations on API Distribution

The main metric is human preference ratings on a held-out set of prompts from the same source as the training distribution.

Evaluations on Public NLP Datasets

Results on the API Distribution

Results on Public NLP Datasets

Qualitative results


Implications for Alignment Research

This research is part of a broader research program to align AI systems with human intentions.

Lessons for alignment research more generally:

Who are We Aligning to?

Factors that influence the fine-tuning data that ultimately determine what and who the models are aligning to.


Limitations of Methodology

Limitations of Models

Open Questions