[Paper] Training language models to follow instructions with human feedback (2022)
Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744.
Point
- Employs Reinforcement Learning from Human Feedback (RLHF) to fine-tune GPT-3 models, aligning them with human intentions while reducing unintended behaviors like hallucinations and toxicity.
- InstructGPT models outperforms GPT-3 in truthfulness and reliability, generalizing well to new tasks like non-English and coding instructions.
- Highlights the need for diverse stakeholder input and suggest combining RLHF with other methods to improve model alignment and safety.
Background
Language models (LMs) often generate misinsformation, toxic or biased content and this issue cannot be resolved simply by increasing the model size. Understanding user intent is crucial for these models. Fine-tuning with human feedback can align the models with user intentions across various tasks.
Large language models (LLMs) frequently exhibit uninteded behaviors, such as hallucinations, toxic text generation, failing to follow user instructions. These are influenced by the model’s objective, which typically involves predicting the next token based on web data, differing from the goal of “following the user instructions helpfully and safely”.
To align LMs, this paper employs Reinforcement Learning from Human Feedbak (RLHF) to fine-tune GPT-3 to follow instructions. Human preferences serve as a reward signal for this fine-tuning process.
Methods and experimental details
High-level methology
- Preparation: Utilize pre-trained language models (GPT-3), prepare a distribution of prompts for alignment, and train human labelers.
- Collect demonstration data and train a supervised policy: Labelers provide input prompts as desired behavior responses. The model is fine-tuned on this data using supervised learning.
- Collect comparison data and train a reward model: Labelers compare model outputs and indicate their preferences. A reward model (RM) is trained using these comparisons to predict human-preferred outputs.
- Optimize a policy aganst the RM using PPO: The RM’s output serves as a scalar reward. The supervised policy (trained GPT-3) is fine-tuned using the PPO algorithm to optimize this reward.
Step 2 and 3 can be iterative: More comparison data is collected on the current best policy, used to train a new RM and subsequently a new policy.
Dataset
Source of prompts:
- Consists of text prompts submitted to the OpenAI API, specifically those using an earlier version of InstructGPT models on the Playground interface.
- The paper does not include data from customers using the API in production.
Deduplication and filtering:
- Heuristically deduplicated by checking for prompts that share a long common prefix.
- The number of prompts is limited to 200 per user ID.
- Validation and test sets contain no data from users whose data is in the training set.
- All prompts in the training split were filtered for personally indentifiable information (PII).
Initial source of prompts: Human-written prompts were used as an initial source of instruction to bootstrap the process.
Datasets for fine-tuning:
- SFT dataset: Labelers’ demonstrations (13k prompts, from the API and labeler-written examples).
- RM dataset: Labeler rankings of model outputs (33k, from the API and labeler-written examples).
- PPO dataset: Inputs for RLHF fine-tuning. Human labels were not used (31k, only from the API).
Use cases: Most of the use-cases have are generative, rather than classification of prompts submitted to InstructGPT models
Tasks
Datasets for training tasks
- Sources: The datasets are sourced from prompts written by labelers and those submitted to early versions of InstructGPT models via API.
- Labeler Instructions: Labelers are trained and instructed to write prompts with specific intents or implicit goals in mind to ensure the model aligns with desired behaviors.
- Language: The datasets are predominately in English (95%). However, the paper also reports the models’ performance in other languages.
Human data collection
Selection of Labelers: A diverse group of labelers was selected to ensure a broad demographic representation. It aims to generate inputs with a wide range of perspectives and to identify potentially harmful outputs.
Training and Evaluation: Labelers underwent tests designed to measure their performance in labeling according to the set standards. This included their ability to generate diverse prompts and accurately identify harmful content.
Models
Pre-trained GPT models are utilized as basis. These models are trianed on a broad distribution of Internet data and can be used for various tasks but initially exhibit poorly characterized behavior. The GPT-3 models are then further trained using three different techniques:
Supervised fine-tuning (SFT)
This method fine-tunes GPT-3 on labeler demonstrations using supervised learning.
- Training details: 16 epochs using a cosine learing rate decay and a residual dropout of 0.2.
- Model selection: Based on the model’s RM score on the validation set.
- Finding: Training for more epochs improves both the RM score and human preference ratings, depite some overfitting.
Reward modeling (RM)
- Base model: Starts with a pre-trained SFT model but the final unembedding layer is removed. This layer maps the model’s representations to the vocabulary space for generating output tokens.
- Input and output: The model takes a prompt and a response are as input and outputs a scalr reward, representing theh quality of the response for the given prompt.
- Model size: Utilizes 6B reward model (RM) for efficiency. A larger 175B RM was found to be unstable and unsuitable for use as the value function in RL.
- Data: Uses comparisons between two model outputs for the same input to determine which output is preferred by human labelers.
- Loss: Trained with cross-entropy loss, using the comparisons as labels. The reward difference reflect the log odds of one response being preferred over the other by a labeler.
- Speed-up comparison collection: Labelers are presented with $K$ responses to rank for each prompt, where $K$ ranges from 4 to 9. This results in $K(K-1) \over 2$ comparisons for each prompt.
- Training efficiency and overfitting:
- Comparisons within each labeling task are very correlated. If all comparisons are shuffled into one dataset and processed in a single pass, the model tends to overfit.
- To address this, the training treats all $K(K-1) \over 2$ comparisons from each prompt as a single batch element, offering several benefits:
- Requires only one forward pass for each set of $K$ responses, instead of $K(K-1) \over 2$ forward passes.
- Prevents overfitting by avoiding isolated highly correlated comparisons.
- Improves computational efficiency, and achieves better validation accuracy and log loss.
-
Loss function:
\[loss(\theta)=-{1\over \binom{K}{2}} E_{(x,y_w,y_l)~D}[\log(\sigma(r_\theta(x,y_w)-r_\theta(x,y_l)))]\]- $r_\theta(x,y)$ is the scalar output of the RM for promt $x$ and completion $y$ with parameters $\theta$.
- $y_w$ is preferred completion out of the pair of $y_w$ and $y_l$.
- $D$ is the dataset of human comparisons.
Reinforcement learning (RL)
- Base model: The SFT model is fine-tuned using Proximal Policy Optimization (PPO) in an environment.
- Training environment: A bandit environment. It this context, a bandit environment presents a random customer prompt, expects a response, produces a reward determined by the RM, and ends the episode.
- Input and output: The model takes the prompt and response as input and outputs a reward determined by the RM.
- KL penalty: A per-token Kullback-Leibler (KL) penalty is added from the SFT model at each token.
- This penalty mitigates over-optimization of the RM and prevents the model from deviating too far from the behavior learned during supervised fine-tuning.
- The value funciton used in PPO is initialized from the RM.
- PPO and PPO-ptx models:
- PPO models: Fine-tuned with PPO.
- PPO-ptx models: Involve an additional experiment where pre-training gradients are mixed into PPO gradients to address performance regressions on public NLP datasets.
-
The objective function for PPO-ptx:
\[\begin{aligned} \text{objective}(\phi) = & \ \mathbb{E}_{(x, y) \sim D_{\pi_{\phi}^{RL}}} \left[ r_\theta(x, y) - \beta \log \left( \frac{\pi_\phi^{RL}(y | x)}{\pi^{SFT}(y | x)} \right) \right] \\ & + \gamma \mathbb{E}_{x \sim D_{\text{pretrain}}} \left[ \log(\pi_\phi^{RL}(x)) \right] \end{aligned}\]where:
- $\pi_\phi^{RL}$ is the learned RL policy and $\pi^{SFT}$ is the supervised fine-tuned model.
- $D_{\pi^{RL}}$ is the distribution of data under the RL policy, and $D_{pretrain}$ is the pre-training distribution.
- $\beta$ is the KL reward coefficient, controlling the strength of the KL penalty.
- $\gamma$ is the pre-training loss coefficient, controlling the influence of pre-training gradients. For PPO models $\gamma$ is set to 0.
- In this paper, InstructGPT refers to the PPO-ptx models.
Baselines
The performance of PPO models is compared against several baselines:
- SFT models: Fine-tuned using supervised learing.
- GPT-3: The standard GPT-3 model without additional fine-tuning.
- GPT-3 Prompted: Provided with a few-shot previx to prompt it into an instruction-following mode, where the prefix is prepended to the user-specified instruction.
- InstructGPT is compared to 175B GPT-3 models fine-tuned on FLAN and T0 datasets. These datasets include various NLP tasks combined with natural language instructions.
Evaluation
The definition of “alignment” to evaluate models is based on their ability to act in accordance with user intentions. The practical evaluation framework checks if the model is helpful, honest and harmless.
- Helpfulness: The model should follow instructions and infer intentions from prompts or a patterns.
- Since the intention could be unclear, labeler preference ratings are considered mainly for evaluation.
- There may be divergence between actual user intentions and labeler interpretations.
- Honesty: Truthfulness is measured instead of comparing the model’s output to its actual belief.
- Two metrics are used:
- The model’s tendency to fabricate information on closed domain tasks
- Performance on the TruthfulQA dataset.
- Two metrics are used:
- Harm: Harmfulness depends on the context in which the model is used, and assessing potential harm requires significatn speculation.
- More specific proxy criteria are used:
- Whether a deployed model could be harmful.
- Labelers evaluate if an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content.
- Benchmarks like RealToxicityPrompts and CrowS-pairs are used to measure bias and toxicity.
- More specific proxy criteria are used:
Evaluation on API distiribution
When using prompts from the API for evaluting human preference ratings, only prompts not included in training are selected.
Since prompts for InsturctGPT models are not suitable for the GPT-3 baselines, prompts submitted to the GPT-3 API are also used for evaluation.
- The GPT-3 prompts are not in an instruction-following style.
- The 175B SFT model is chosen as the baseline due to its average performance.
Each model is evaluated based on how often its outputs are preferred, and labelers judge the overall quality of each response on a 1-7 Likert scale.
Evaluation on public NLP datasets
Two types of public datasets are used:
- Safety evaluation: Focuses on truthfulness, toxicity, and bias. Includes evaluations of toxicity using the RealToxicityPrompts dataset.
- Zero-shot performance: Assesses performance on traditional NLP tasks such as question anwering (QA), reading comprehension, and summarization.
Results
The experimental results are organized into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results.
Results on the API distribution
1. Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.
- 175B InstructGPT outputs are preferred to GPT-3 outputs around 85% of the time and around 71% compared to few-shot GPT-3.
- The preference order is GPT-3 < GPT-3 Prompted < SFT < PPO.
- Adding updates on the pre-training mix during PPO does not lead to significant changes in labeler preference.
- This preference trend remains consistent when evaluating models on prompts submitted to GPT-3 models on the API, though PPO-ptx models perform slightly worse at larger sizes.
- InstructGPT outputs are rated favorably on more concrete axes: They follow constraints and instruction better and hallucinate less.
- This suggests that InstructGPT models are more reliable and easier to control than GPT-3.
2. InstructGPT models generalize to the preferences of “held-out” labelers that did not produce any training data.
- InstructGPT models’ outputs are rated better than GPT-3 baselines by held-out labelers, indicating InstructGPT models are not simiply overfitting to the preferences of training labelers.
- RMs also demonstrate generlization capabilties with cross-validation results: 69.6% accuracy in predicting the preferences of held-out labelers, which is slightly lower than 72.4% accuracy in the predicting preferences within the training set.
3. Public NLP datasets are not reflective of how the LMs are used.
- When comparing InstructGPT to 175B GPT-3 baseline fine-tuned on FLAN and T0, these models perform better than GPT-3 with a good prompt but worse than the SFT baseline. This suggests the datasets are not sufficiently diverse to improve API prompt distribution.
- InstructGPT may outperform FLAN and T0 because:
- Public NLP datasets are desinged to capture typical tasks that are easy to evaluate (e.g., classification, QA). However, open-ended generation and brainstorming constitute most (57%) of tasks the API users want.
- Public NLP datasets may lack the high diversity of inputs that real-world users are interested in.
Results on public NLP datasets
1. InstructGPT models show improvements in truthfulness over GPT-3.
- PPO models demonstrate significant improvements on the TruthfulQA dataset.
- The 1.3B PPO-ptx model performs slightly worse than GPT-3 of the same size.
- Training with an “Instruction+QA” prompt helps the model avoid generating false information.
- Instruction+QA: Instructs the model to respond with “I have no comment” when it’s uncertain of the correct answer.
2. InstructGPT shows small improvements in toxicity over GPT-3, but not bias.
- Toxicity: Evaluated using the RealToxicityPrompts benchmark.
- Evaluation method: Toxicity scores are obtained through the Perspective API with model samples and labelers rate the samples.
- InstructGPT outputs are less toxic than those of GPT-3 when instructed to generate respectful outputs. Without any prompt, the models are similar, and InstructGPT can be more toxic when prompted to produce toxic content.
- Bias: Evaluated using the Winogender and CrowS-Pairs benchmarks.
- Evaluation method: Calculates the relative probabilities of producing sentences in each pair and the entropy of the associated binary probability distributions.
- Unbiased models will show no preference, thus having maximum entropy.
- InstructGPT and GPT-3 show similar levels of bias. The PPO-ptx model shows higher bias when instructed to act respectfully, with unclear patterns.
- Instructed models tend to be more certain of their outputs, regardlessly with stereotypes.
- Evaluation method: Calculates the relative probabilities of producing sentences in each pair and the entropy of the associated binary probability distributions.
3. Modifying RLHF fine-tuning procedures can minimize performance regressions on public NLP datasets.
- Alignment tax: PPO model experience a decrease in performance on public NLP datasets, referred to as “alignment tax”.
- Mitigation strategies: Mixing pre-training updates to the PPO fine-tuning (PPO-ptx) reduces performance regressions across all datasets.
- PPO-ptx performs better than merely increasing the KL coefficient. Changing the KL model from the PPO initialization to GPT-3 yields similar improvements.
Qualitative results
1. InstructGPT models show promising generlization to instructions outside of the RLHF fine-tuning distribution.
- InstructGPT models can follow non-English instructions, and perform coding tasks, despite limited training data in these formats.
- Alignment methods can generalize to produce desired behaviors on inputs not directly supervised.
- 175B PPO-ptx model can answer questions about code and non-English instructions, but often responds in English to questions in other languages.
2. InstructGPT still makes simple mistakes.
- The model sometimes incorrectly assumes a false premise in an instruction is true.
- It can overly hedge even when the answer is clear.
- It struggles with generating responses when there’re multiple or challenging constraints in an instruction.
Discussion
Implications for alignment research
Improving the alignment of current AI systems provides a clear empirical feedback loop, esssential for refining alignment techniques.
Moreover, RLHF is an important building block for aligning superhuman systems, especially for tasks difficult to evaluate.
General lessons for alignment research:
- The cost of increasing model alignment is modest relative to pre-training: The significant costs lie in data collection and computation. With RLHF, larger LMs become more helpful, suggesting investing in aligning existing LMs is more efficient than training new, larger models.
- There is evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in: E.g., non-English and code tasks. This is important as creating supervised models for each task is expensive.
- The proposed fine-tuning can mitigate most of the performance degradations: Low alignment tax techniques are needed for future AI systems capable of understanding human intents, and RLHF is effective in this regard.
- Alignment techniques are validated in the real world: This work grounds alignment research in real-world applications, providing valuable insights for AI systems used by actual users.
Who are we aligning to?
Factors influencing the fine-tuning data and key sources of alignment preferences:
- Labelers’ preferences: The models are aligned to the preferences of hired labelers who generate the training data. They are mostly English speakers, with around 73% agreement among them.
- Researchers’ preferences: Researchers design the study, write instructions, and guide labelers on edge cases, thereby influencing the alignment. More research is needed to understand the impact of different instructions and interfaces on the collected data and model behavior.
- Customer prompts: Training data includes prompts from OpenAI customers using the API. There is potential misalignment between customer goals and end-user well-being.
- Customer representation: The customers are not representative of all potential or current LM users. The initial user base was biased towards OpenAI’s networks.
Challenges and future directions:
- Designing a fair and transparent alignment process is complex.
- This paper demonstrates that the alignment method can work for a specific human reference group but doesn’t claim these group preferences are ideal.
- Multiple stakeholders need consideration, including model trainers, developers, end-users, and the broader impacted population.
- Aligning a system to everyone’s preferences simultaneously is impossible, and not all trade-offs will be universally endorsed.
- One potential approach is to train models for different group preferences so that it can reflect diverse values. However, this may still impact broader society, raising decisions about prioritizing preferences.
Limitations
Methodology:
- Contractor influence: InstructGPT is influenced by the human feedback from about 40 contractors.
- Contractors’ identity, beliefs, cultural backgrounds, and personal history may affect their judgments.
- They were selected based on their performance with sensitive prompts and labeling tasks.
- The small team size allowed for better communication but is not representative of the broader population will use the models.
- They are mostly English-speaking, and the data is almost entirely in English.
- Data collection improvements: Most comparisons are labeled by only one contractor to reduce costs.
- Multiple labelings could help identify disagreement areas, indicating where a single model may not align with all labelers.
- Averaging labeler preferences for disagreements might not be ideal, especially for minority groups, whose preferences should be weighted more heavily.
Models:
- Imcomplete alignment and safety: InstructGPT is not fully aligned or safe.
- It still generates toxic or biased outputs, misinformations, and sexual or violent content.
- It sometimes fails to generate reasonable outputs for certain inputs.
- Following potentially harmful instructions: InstructGPT often follows instructions even if it could lead to real-world harm.
- It produces more toxic outputs than GPT-3 when instructed to be maximally biased.