Coffee Chat Brewing AI Knowledge

eng kor

Long Short-term Memory (LSTM)

Recurrent neural network (RNN)의 Long-term dependency 문제를 해결하고자 만들어진 프레임워크이다.

핵심적인 아이디어는 이전 시점의 state 정보를 이후 state에 얼마나 반영할지를 결정하는 계산을 추가해주는 것이다. 이것을 위해 forget gate, input gate, output gate의 3가지 Gate와 memory cell이 추가 되었다.

LSTM의 계산 방식과 비교하기 위해 RNN의 계산식을 되짚어보면 다음과 같다.

\[\begin{align*} h_t&=\tau(Wh^{t-1}+Ux^t) \\ \hat{y}_t&=softmax(Vh^t) \\ \end{align*}\]

이전 시점($t-1$)의 hidden state와 현재 시점($t$)의 input(둘 다 weighted)을 더하여 tanh를 통과시키면 현재 시점의 hidden state가 된다. 이것에 softmax를 취하면 output이 된다.


LSTM 계산식

LSTM은 RNN의 방식에 residual connection 구조에서 착안한 memory 기능을 더하여 long-term dependency 문제를 해결하고자 했다. Memory 기능을 위해 추가된 계산들은 다음과 같다.

Forget gate

Forget gate $f$는 이전 시점의 정보를 얼마나 잊을지 결정하는 gate이다.

\[f_t=\sigma(W_fh_{t-1}+U_fx_t)\]

이전 시점의 hidden state와 현재 시점의 input을 더한 뒤 sigmoid를 취한다. 이것이 이전 시점의 memory cell state에 곱해진다. sigmoid의 특성에 의해 1에 가까울수록 이전 정보가 이후 많이 반영된다.

Input gate

Input gate $i$는 현재 시점의 input을 다음 시점에 얼마나 반영할지 결정하는 gate이다. 여기서 candidate $\hat(c)$라는 개념이 등장하는데, candidate는 이전 시점의 hidden state와 현재 시점의 input을 고려했을 때 현재의 정보가 어떠한지를 나타내는 cell state의 후보 격인 값이다. 계산 방식이 RNN의 hidden state와 동일하다. input gate 값과 candidate를 곱해 현재의 정보 상 input이 얼마나 반영되면 좋은지를 구하고, 이것을 최종적으로 cell state에 더한다.

\[\begin{align} i_t&=\sigma(W_{in}h_{_t-1}+U_{in}x_{t})\\ \hat{C}_t&=\tau(W_{c}h_{t-1}+U_{c}x_t) \end{align}\]

Memory cell

Memory cell (cell state)은 세 가지 gate와 함께 LSTM의 구현 목적을 위해 추가된 개념이다. 현재 시점의 cell state는 이전 시점의 cell state 및 현재 시점의 forget gate와 현재 시점의 input gate 및 candidate로 계산한다.

\[C_t=f_t*C_{t-1}+i_t*\hat{C}_t\]

$*$는 pointwise operation

이전 정보인 cell state와 현재 input을 얼마나 반영할지가 합해져 현재 시점의 cell state가 구해진다.

Output gate

Output gate는 memory cell을 현재 시점의 hidden state에 얼마나 반영할지 결정한다.

\[\begin{align} o_t&=\sigma(W_oh_{t-1}+U_ox_t) \\ h_t&=o_t\tau(C_t) \\ &=o_t\tau(f_t*C_{t-1}+i_t*\hat{C}_t) \\ \end{align}\]

현재 시점의 hidden state는 이전 시점의 정보와 현재 시점의 input이 반영된 현재 시점의 cell state와 output gate의 결과값과 곱해져 최종 결정된다.

Output

최종 출력 $\hat{y}_t$은 RNN과 같이 계산된다.

\[\hat{y}_t=softmax(Vh_t)\]


LSTM의 한계

LSTM은 cell state 도입을 통해 gradient vanishing 문제를 해결하고자 하였다. 하지만 RNN 구조를 기반으로 하고 있는 한 이 문제를 완벽하게 해결하기에 한계가 있다. 오히려 gate를 여러 개 사용하여 계산량이 증가하는 문제가 있다.

[Paper] Alpaca: A Strong, Replicable Instruction-Following Model

Paper Link

Points

  • Alpaca aims to support academic research on instruction-following large language models (LLMs), addressing deficiencies like hallucinations, toxicity, and biases.
  • Uses the self-instruct approach to create an instruction-following dataset with text-davinci-003, costing under $500.
  • The LLaMA 7B model is fine-tuned using efficient techniques.


Background

LLMs trained through instruction-following, such as ChatGPT, have significantly impacted daily life. However, these models still face issues like generating misinformation, toxic content, and exhibiting social biases. To address these problems, academic research is essential. Closed-source models hinder this research, making it difficult to study instruction-following models.

Alpaca is a model designed for academic research, fine-tuned from the LLaMA 7B model using 52k instruction-following data generated from OpenAI’s text-davinci-003. Commercial use of Alpaca is prohibitied by following reasons:

  • Non-commercial license: LLaMA
  • Data restrictions: Based on text-davinci-003 prohibiting competition with OpenAI
  • Deployment caution: Not designed with adequate safety mesuares for general use.


Training Recipe

To train a high-quality instruction-following model under an academic budget, two key challenges are addressed:

  1. Strong pre-trained language model: LLaMA models
  2. High-quality instruction-following data: Self-instruct method

Self-instruct method

  • Seed set: 175 human-written instruction-following output pairs from self-instruct seed set.
  • Data generation: Prompting text-davinci-003 to generate more instructions using the seed set as examples.
  • Efficiency: Improved the self-instruct method, generating 52k unique instructions and outputs for less than $500 using the OpenAI API.

fig1

Fine-tuning the model

  • Process: LLaMA models are fine-tuned with the generated instruction-following dataset using fully shared data parallel (FSDP) and mixed precision trianing.
  • Cost and time: Fine-tuning a 7B LLaMA model took 3 hours on eight 80GB A100s, costing less than $100 on most cloud compute providers.


Preliminary Evaluation

Human evaluation was conducted on inputs from the self-instruct evaluation set. Key findings include:

  • Comparison: Alpaca 7B vs. text-davinci-003
  • Performance: Alpaca wins 90 to 89 comparisons.
    • Given Alpaca’s smaller size and limited data, it performed similarly to text-davinci-003.
  • Generation style: Alpaca’s outputs tend to be similar with text-davinci-003, and reflect the general style of the training dataset.
  • Evaluation limitation: The evaluation data’s limitations should be noted.
  • An interactive demo was released to gather further feedback.


Known Limitiations

Alpaca shares common deficiencies with LLMs, such as hallucinations, toxicity, and stereotypes. It struggles particularly with hallucination, sometimes producing well-written misinformation. Despite these issues, Alpaca provides a lightweight model for studying these deficiencies, aiding academic research.


Release

Released assets:

  • Demo: Interactive demo for evaluation
  • Data: 52k demonstrations used to fine-tune Alpaca
  • Data generation process: Code for generating the data
  • Training code: Fine-tuning code using Hugging Face API

Future release:

  • Model weights: Pending guidance from Meta

The release aims to support academic studies on instruction-following LMs and developing new technique to address the existing deficiencies.


[Paper] Llama: Open and efficient foundation language models (2023)

Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

Paper Link

Points

  • Efficient inference with smaller models: LLaMA models prioritize inference efficiency by using smaller models trained on large datasets, achieving state-of-the-art (SOTA) performance across benchmarks while being cost-effective during inference.
  • Publicly available data: Unlikely many existing models that rely on proprietary data, LLaMA models are exclusively trained on publicly available datasets, ensuring transparency and compatibility with open-source principles.
  • Broad Benchmark Performance: LLaMA models demonstrate competitive performance on a wide range of tasks, including common sense reasoning, question answering, reading comprehension, etc,.


Background

Large language models (LLMs) have demonstrated remarkable capabilities in performing new tasks with minimal instruction or examples, thanks to their vast size. However, recent research suggests that smaller models trained on larger datasets can achieve superior performance, highlighting the importance of efficiency during inference rather than training.


Approach

LLaMA is a series of language models (LMs) designed to optimize performance across various inference budgets, ranging from 7B to 65B parameters, using only publicly available data.

Pre-training data

The dataset mixture used cover diverse domains and is entirely publicly avaliable, ensuring compatibility with open-source principles:

  1. English CommonCrawl [67%]: Preprocessed from five CommonCrawl dumps (2017-2020)., filtered for non-English and low-quality content.
  2. C4 [15%]: Similarly preprocessed to CommonCrawl, to enhance performance.
  3. Github [4.5%]: Filtered for line length and alphanumeric content from Google BigQuery.
  4. Wikipedia [4.5%]: Dumps from mid-2022, covering multiple languages.
  5. Gutenberg and Books3 [4.5%]: Publicly available books with redundant content removed.
  6. ArXiv [2.5%]: Includes scientific data, with non-essential content removed.
  7. Stack Exchange [2%]: High-quality Q&A content sorted by score.

Tokenization

  • Byte Pair Encoding (BPE) tokenizer used.
  • Splits numbers into digits and decomposes unknown UTF-8 characters.
  • The training dataset contains approximately 1.4T tokens, with minimal repetition (fig 1).

    fig1

Architecture

LLaMA models are based on transformer architecture with key modifications:

  1. Pre-normalization [GPT3]: Normalizes the input of each transformer sub-layer, enhancing training stability using RMSNorm.
  2. SwiGLU activation function [PaLM]: Uses SwiGLU instead of ReLU, improving performance with a dimension of $2\over3 4d$ instead of $4d$ as in PaLM.
  3. Rotary Embeddings [GPTNeo]: Employs Rotary embeddings (RoPE) instead of absolute positional embeddings at each layer of the network.

Optimizer

Trained using the AdamW optimizer with:

  • $\beta_1=0.9, \beta_2=0.95$.
  • Cosine learning rate schedule, ending at 10% of the maximal rate.
  • Weight decay of 0.1 and gradient clipping of 1.0.
  • 2,000 warmup-steps, with varying learning rates and batch size with the size of the model (table 2).

    table2

Efficient implementation

  1. Causal multi-head attention: Efficient implementation using xformer library to reduce memory and runtime.
  2. Activation reductions: Uses checkpointing to recompute activations during the backward pass, especially for computationally expensive layers.


Main Results

Evaluated on 20 benchmarks for zero-shot and few-shot tasks, compared to non-public models (GPT-3, Gopher, Chinchilla, PaLM) and open-sourced models (OPT, GPT-J, GPT-Neo).

Common sense reasonging

table3

  • Benchmarks: Eight standard benchmarks such as BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA. Theses datasets include Cloze and Winograd style tasks and multiple choice question answering (QA).
  • Results
    • LLaMA-65B outperforms Chinchilla 70B and PaLM-540B on most benchmarks except BoolQ.
    • LLaMA-13B outperforms GPT-3 on most benchmarks despite being significantly smaller.

Close-book question answering

table4 table5

  • Benchmarks: Natural Questions and TriviaQA. The models report exact match performance where the models do not have access to documents that contain avidence to answer the question.
  • Results:
    • LLaMA-65B achieve state-of-the-art (SOTA) performance in zero-shot and few-shot settings.
    • LLaMA-13B is competitive with GPT-3 and Chinchilla which are larger models.

Reading comprehension

table6

  • Benchmark: RACE reading comprehension, collected from English reading comprehension exams in middle and high school Chinese students.
  • Results: LLaMA-65B is competitive with PaLM-540B, and LLaMA-13B outperforms GPT-3.

Mathematical reasoning

table7

  • Benchmarks: MATH and GSM8k. MATH contains 12K math problems of middle and high school. GSM8k is a set of middle school math problems.
  • Results: LLaMA-65B outperforms Minerva-62B on GSM8k.
    • Minerva is a series of PaLM models fine-tuned on 38.5B tokens extracted from ArXiv and Math Web Pages. Both PaLM and LLaMA, however, are not finetuned on math data.

Code generation

table8

  • Benchmarks: HumanEval and MBPP. The models are evaluated about their ability to write code from a natural language description.
  • Results:
    • LLaMA models outperform other models, including LaMDA and PaLM. LLaMA-13B outperforms LaMDA-137B. LLaMA 65B outperforms PaLM-62B.
    • Fine-tuning on code-specific tokens further improves performance.

Massive multitask language understanding

table9

  • Massive multitask language understanding (MMLU) consists of multiple choice questions covering various domains of knowledge, like humanities, STEM and social sciences.
  • Results: LLaMA-65B underperforms compared to Chinchilla-70B and PaLM-540B, possibly due to limited academic data.

Evolution of performance during training

fig2

  • Performance improves steadily, and correlates with the training perplexity of the model.
  • SIQA and WinoGrande are the exceptions: SIQA may not be reliable as performance is varied, and performance doesn’t correlate with training perplexity on WinoGrande.


Instruction Fine-tuning

Fine-tuning improves performance and futher the ability to follow instructions. LLaMA-I is trained on MMLU with instructions and compared with OPT-IML and Flan-PaLM series which fine-tuned with moderate sizes.

table10

  • LLaMA-I with 65B parameter size outperforms existing instruction fine-tuned models, but remains behind GPT ‘code-davinci-002’.


Bias, Toxicity and Misinformation

LLMs have been showed to be biased to content of training data, and to generate toxic content. Evaluated using benchmarks for toxic content generation and stereotypes detection.

RealToxicityPrompts

Indicates how toxic is a model. The toxicity score is automatically evaluated by making a request to PerspectiveAPI, ranging from 0 (non-toxic) to 1 (toxic).

table11

  • Comparable to other models, with larger models exhibiting more toxicity, especially for “Respectiful” prompts.
  • It can be suggested that the relation between toxicity and model size may only apply within a model family.

CrowS-Pairs

Evaluates the biases in a model with 9 categories: gender, religion, race, sexual orientation, age, nationality, disability, physical appearance and socioenconomic status.

table12

  • LLaMA shows slight biases, particularly in the religion, age, and gender categories. This may be come from CommonCrawl dataset.

WinoGender

Used to investigate the bias of a model on the gender category. It evaluates if the model’s co-reference resolution performance is impacted by the gender of the pronoun.

table13

  • Performance varies by pronoun type: The models have better performance “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns.
  • Larger models showing more gender bias: For “gotcha” cases, LLaMA-65B makes more errors, showing that it capture biases on gender.
    • “gotcha” cases are in which the pronoun does not match the majority gender of the occupation, and the occupation is the correct answer.

TruthfulQA

Evaluates the a model’s ability to identify true claims and measures the risk of generating misinformation or false claims. This assesses the truthfulness of a model’s responses.

table14

  • LLaMA models show better truthfulness compared to GPT-3. However the correct answer rate remains low, indicating a potential for misinformation.


Carbon footprint

Details the environmental impact of training and deploying these models.

table15


[Paper] Training language models to follow instructions with human feedback (2022)

Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744.

Paper Link

Point

  • Employs Reinforcement Learning from Human Feedback (RLHF) to fine-tune GPT-3 models, aligning them with human intentions while reducing unintended behaviors like hallucinations and toxicity.
  • InstructGPT models outperforms GPT-3 in truthfulness and reliability, generalizing well to new tasks like non-English and coding instructions.
  • Highlights the need for diverse stakeholder input and suggest combining RLHF with other methods to improve model alignment and safety.


Background

Language models (LMs) often generate misinsformation, toxic or biased content and this issue cannot be resolved simply by increasing the model size. Understanding user intent is crucial for these models. Fine-tuning with human feedback can align the models with user intentions across various tasks.

Large language models (LLMs) frequently exhibit uninteded behaviors, such as hallucinations, toxic text generation, failing to follow user instructions. These are influenced by the model’s objective, which typically involves predicting the next token based on web data, differing from the goal of “following the user instructions helpfully and safely”.

To align LMs, this paper employs Reinforcement Learning from Human Feedbak (RLHF) to fine-tune GPT-3 to follow instructions. Human preferences serve as a reward signal for this fine-tuning process.


Methods and experimental details

fig2

High-level methology

  1. Preparation: Utilize pre-trained language models (GPT-3), prepare a distribution of prompts for alignment, and train human labelers.
  2. Collect demonstration data and train a supervised policy: Labelers provide input prompts as desired behavior responses. The model is fine-tuned on this data using supervised learning.
  3. Collect comparison data and train a reward model: Labelers compare model outputs and indicate their preferences. A reward model (RM) is trained using these comparisons to predict human-preferred outputs.
  4. Optimize a policy aganst the RM using PPO: The RM’s output serves as a scalar reward. The supervised policy (trained GPT-3) is fine-tuned using the PPO algorithm to optimize this reward.

Step 2 and 3 can be iterative: More comparison data is collected on the current best policy, used to train a new RM and subsequently a new policy.


Dataset

Source of prompts:

  • Consists of text prompts submitted to the OpenAI API, specifically those using an earlier version of InstructGPT models on the Playground interface.
  • The paper does not include data from customers using the API in production.

Deduplication and filtering:

  • Heuristically deduplicated by checking for prompts that share a long common prefix.
  • The number of prompts is limited to 200 per user ID.
  • Validation and test sets contain no data from users whose data is in the training set.
  • All prompts in the training split were filtered for personally indentifiable information (PII).

Initial source of prompts: Human-written prompts were used as an initial source of instruction to bootstrap the process.

Datasets for fine-tuning:

  • SFT dataset: Labelers’ demonstrations (13k prompts, from the API and labeler-written examples).
  • RM dataset: Labeler rankings of model outputs (33k, from the API and labeler-written examples).
  • PPO dataset: Inputs for RLHF fine-tuning. Human labels were not used (31k, only from the API).

Use cases: Most of the use-cases have are generative, rather than classification of prompts submitted to InstructGPT models

fig1


Tasks

Datasets for training tasks

  • Sources: The datasets are sourced from prompts written by labelers and those submitted to early versions of InstructGPT models via API.
  • Labeler Instructions: Labelers are trained and instructed to write prompts with specific intents or implicit goals in mind to ensure the model aligns with desired behaviors.
  • Language: The datasets are predominately in English (95%). However, the paper also reports the models’ performance in other languages.


Human data collection

Selection of Labelers: A diverse group of labelers was selected to ensure a broad demographic representation. It aims to generate inputs with a wide range of perspectives and to identify potentially harmful outputs.

Training and Evaluation: Labelers underwent tests designed to measure their performance in labeling according to the set standards. This included their ability to generate diverse prompts and accurately identify harmful content.


Models

Pre-trained GPT models are utilized as basis. These models are trianed on a broad distribution of Internet data and can be used for various tasks but initially exhibit poorly characterized behavior. The GPT-3 models are then further trained using three different techniques:


Supervised fine-tuning (SFT)

This method fine-tunes GPT-3 on labeler demonstrations using supervised learning.

  • Training details: 16 epochs using a cosine learing rate decay and a residual dropout of 0.2.
  • Model selection: Based on the model’s RM score on the validation set.
  • Finding: Training for more epochs improves both the RM score and human preference ratings, depite some overfitting.


Reward modeling (RM)

  • Base model: Starts with a pre-trained SFT model but the final unembedding layer is removed. This layer maps the model’s representations to the vocabulary space for generating output tokens.
  • Input and output: The model takes a prompt and a response are as input and outputs a scalr reward, representing theh quality of the response for the given prompt.
  • Model size: Utilizes 6B reward model (RM) for efficiency. A larger 175B RM was found to be unstable and unsuitable for use as the value function in RL.
  • Data: Uses comparisons between two model outputs for the same input to determine which output is preferred by human labelers.
  • Loss: Trained with cross-entropy loss, using the comparisons as labels. The reward difference reflect the log odds of one response being preferred over the other by a labeler.
  • Speed-up comparison collection: Labelers are presented with $K$ responses to rank for each prompt, where $K$ ranges from 4 to 9. This results in $K(K-1) \over 2$ comparisons for each prompt.
  • Training efficiency and overfitting:
    • Comparisons within each labeling task are very correlated. If all comparisons are shuffled into one dataset and processed in a single pass, the model tends to overfit.
    • To address this, the training treats all $K(K-1) \over 2$ comparisons from each prompt as a single batch element, offering several benefits:
      • Requires only one forward pass for each set of $K$ responses, instead of $K(K-1) \over 2$ forward passes.
      • Prevents overfitting by avoiding isolated highly correlated comparisons.
      • Improves computational efficiency, and achieves better validation accuracy and log loss.
  • Loss function:

    \[loss(\theta)=-{1\over \binom{K}{2}} E_{(x,y_w,y_l)~D}[\log(\sigma(r_\theta(x,y_w)-r_\theta(x,y_l)))]\]
    • $r_\theta(x,y)$ is the scalar output of the RM for promt $x$ and completion $y$ with parameters $\theta$.
    • $y_w$ is preferred completion out of the pair of $y_w$ and $y_l$.
    • $D$ is the dataset of human comparisons.


Reinforcement learning (RL)

  • Base model: The SFT model is fine-tuned using Proximal Policy Optimization (PPO) in an environment.
  • Training environment: A bandit environment. It this context, a bandit environment presents a random customer prompt, expects a response, produces a reward determined by the RM, and ends the episode.
  • Input and output: The model takes the prompt and response as input and outputs a reward determined by the RM.
  • KL penalty: A per-token Kullback-Leibler (KL) penalty is added from the SFT model at each token.
    • This penalty mitigates over-optimization of the RM and prevents the model from deviating too far from the behavior learned during supervised fine-tuning.
    • The value funciton used in PPO is initialized from the RM.
  • PPO and PPO-ptx models:
    • PPO models: Fine-tuned with PPO.
    • PPO-ptx models: Involve an additional experiment where pre-training gradients are mixed into PPO gradients to address performance regressions on public NLP datasets.
    • The objective function for PPO-ptx:

      \[\begin{aligned} \text{objective}(\phi) = & \ \mathbb{E}_{(x, y) \sim D_{\pi_{\phi}^{RL}}} \left[ r_\theta(x, y) - \beta \log \left( \frac{\pi_\phi^{RL}(y | x)}{\pi^{SFT}(y | x)} \right) \right] \\ & + \gamma \mathbb{E}_{x \sim D_{\text{pretrain}}} \left[ \log(\pi_\phi^{RL}(x)) \right] \end{aligned}\]

      where:

      • $\pi_\phi^{RL}$ is the learned RL policy and $\pi^{SFT}$ is the supervised fine-tuned model.
      • $D_{\pi^{RL}}$ is the distribution of data under the RL policy, and $D_{pretrain}$ is the pre-training distribution.
      • $\beta$ is the KL reward coefficient, controlling the strength of the KL penalty.
      • $\gamma$ is the pre-training loss coefficient, controlling the influence of pre-training gradients. For PPO models $\gamma$ is set to 0.
  • In this paper, InstructGPT refers to the PPO-ptx models.


Baselines

The performance of PPO models is compared against several baselines:

  • SFT models: Fine-tuned using supervised learing.
  • GPT-3: The standard GPT-3 model without additional fine-tuning.
  • GPT-3 Prompted: Provided with a few-shot previx to prompt it into an instruction-following mode, where the prefix is prepended to the user-specified instruction.
  • InstructGPT is compared to 175B GPT-3 models fine-tuned on FLAN and T0 datasets. These datasets include various NLP tasks combined with natural language instructions.


Evaluation

The definition of “alignment” to evaluate models is based on their ability to act in accordance with user intentions. The practical evaluation framework checks if the model is helpful, honest and harmless.

  • Helpfulness: The model should follow instructions and infer intentions from prompts or a patterns.
    • Since the intention could be unclear, labeler preference ratings are considered mainly for evaluation.
    • There may be divergence between actual user intentions and labeler interpretations.
  • Honesty: Truthfulness is measured instead of comparing the model’s output to its actual belief.
    • Two metrics are used:
      • The model’s tendency to fabricate information on closed domain tasks
      • Performance on the TruthfulQA dataset.
  • Harm: Harmfulness depends on the context in which the model is used, and assessing potential harm requires significatn speculation.
    • More specific proxy criteria are used:
      • Whether a deployed model could be harmful.
      • Labelers evaluate if an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content.
      • Benchmarks like RealToxicityPrompts and CrowS-pairs are used to measure bias and toxicity.


Evaluation on API distiribution

When using prompts from the API for evaluting human preference ratings, only prompts not included in training are selected.

Since prompts for InsturctGPT models are not suitable for the GPT-3 baselines, prompts submitted to the GPT-3 API are also used for evaluation.

  • The GPT-3 prompts are not in an instruction-following style.
  • The 175B SFT model is chosen as the baseline due to its average performance.

Each model is evaluated based on how often its outputs are preferred, and labelers judge the overall quality of each response on a 1-7 Likert scale.


Evaluation on public NLP datasets

Two types of public datasets are used:

  • Safety evaluation: Focuses on truthfulness, toxicity, and bias. Includes evaluations of toxicity using the RealToxicityPrompts dataset.
  • Zero-shot performance: Assesses performance on traditional NLP tasks such as question anwering (QA), reading comprehension, and summarization.


Results

The experimental results are organized into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results.

Results on the API distribution

1. Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.

fig1

  • 175B InstructGPT outputs are preferred to GPT-3 outputs around 85% of the time and around 71% compared to few-shot GPT-3.
  • The preference order is GPT-3 < GPT-3 Prompted < SFT < PPO.
  • Adding updates on the pre-training mix during PPO does not lead to significant changes in labeler preference.

fig3

  • This preference trend remains consistent when evaluating models on prompts submitted to GPT-3 models on the API, though PPO-ptx models perform slightly worse at larger sizes.

fig4

  • InstructGPT outputs are rated favorably on more concrete axes: They follow constraints and instruction better and hallucinate less.
  • This suggests that InstructGPT models are more reliable and easier to control than GPT-3.


2. InstructGPT models generalize to the preferences of “held-out” labelers that did not produce any training data.

  • InstructGPT models’ outputs are rated better than GPT-3 baselines by held-out labelers, indicating InstructGPT models are not simiply overfitting to the preferences of training labelers.
  • RMs also demonstrate generlization capabilties with cross-validation results: 69.6% accuracy in predicting the preferences of held-out labelers, which is slightly lower than 72.4% accuracy in the predicting preferences within the training set.


3. Public NLP datasets are not reflective of how the LMs are used.

fig5

  • When comparing InstructGPT to 175B GPT-3 baseline fine-tuned on FLAN and T0, these models perform better than GPT-3 with a good prompt but worse than the SFT baseline. This suggests the datasets are not sufficiently diverse to improve API prompt distribution.
  • InstructGPT may outperform FLAN and T0 because:
    • Public NLP datasets are desinged to capture typical tasks that are easy to evaluate (e.g., classification, QA). However, open-ended generation and brainstorming constitute most (57%) of tasks the API users want.
    • Public NLP datasets may lack the high diversity of inputs that real-world users are interested in.

Results on public NLP datasets

1. InstructGPT models show improvements in truthfulness over GPT-3.

fig6

  • PPO models demonstrate significant improvements on the TruthfulQA dataset.
  • The 1.3B PPO-ptx model performs slightly worse than GPT-3 of the same size.
  • Training with an “Instruction+QA” prompt helps the model avoid generating false information.
    • Instruction+QA: Instructs the model to respond with “I have no comment” when it’s uncertain of the correct answer.


2. InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

fig7

  • Toxicity: Evaluated using the RealToxicityPrompts benchmark.
    • Evaluation method: Toxicity scores are obtained through the Perspective API with model samples and labelers rate the samples.
    • InstructGPT outputs are less toxic than those of GPT-3 when instructed to generate respectful outputs. Without any prompt, the models are similar, and InstructGPT can be more toxic when prompted to produce toxic content.
  • Bias: Evaluated using the Winogender and CrowS-Pairs benchmarks.
    • Evaluation method: Calculates the relative probabilities of producing sentences in each pair and the entropy of the associated binary probability distributions.
      • Unbiased models will show no preference, thus having maximum entropy.
    • InstructGPT and GPT-3 show similar levels of bias. The PPO-ptx model shows higher bias when instructed to act respectfully, with unclear patterns.
    • Instructed models tend to be more certain of their outputs, regardlessly with stereotypes.


3. Modifying RLHF fine-tuning procedures can minimize performance regressions on public NLP datasets.

  • Alignment tax: PPO model experience a decrease in performance on public NLP datasets, referred to as “alignment tax”.

fig28 fig29

  • Mitigation strategies: Mixing pre-training updates to the PPO fine-tuning (PPO-ptx) reduces performance regressions across all datasets.

fig33

  • PPO-ptx performs better than merely increasing the KL coefficient. Changing the KL model from the PPO initialization to GPT-3 yields similar improvements.


Qualitative results

1. InstructGPT models show promising generlization to instructions outside of the RLHF fine-tuning distribution.

  • InstructGPT models can follow non-English instructions, and perform coding tasks, despite limited training data in these formats.
  • Alignment methods can generalize to produce desired behaviors on inputs not directly supervised.

fig8

  • 175B PPO-ptx model can answer questions about code and non-English instructions, but often responds in English to questions in other languages.


2. InstructGPT still makes simple mistakes.

fig9

  • The model sometimes incorrectly assumes a false premise in an instruction is true.
  • It can overly hedge even when the answer is clear.
  • It struggles with generating responses when there’re multiple or challenging constraints in an instruction.


Discussion

Implications for alignment research

Improving the alignment of current AI systems provides a clear empirical feedback loop, esssential for refining alignment techniques.

Moreover, RLHF is an important building block for aligning superhuman systems, especially for tasks difficult to evaluate.

General lessons for alignment research:

  • The cost of increasing model alignment is modest relative to pre-training: The significant costs lie in data collection and computation. With RLHF, larger LMs become more helpful, suggesting investing in aligning existing LMs is more efficient than training new, larger models.
  • There is evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in: E.g., non-English and code tasks. This is important as creating supervised models for each task is expensive.
  • The proposed fine-tuning can mitigate most of the performance degradations: Low alignment tax techniques are needed for future AI systems capable of understanding human intents, and RLHF is effective in this regard.
  • Alignment techniques are validated in the real world: This work grounds alignment research in real-world applications, providing valuable insights for AI systems used by actual users.


Who are we aligning to?

Factors influencing the fine-tuning data and key sources of alignment preferences:

  • Labelers’ preferences: The models are aligned to the preferences of hired labelers who generate the training data. They are mostly English speakers, with around 73% agreement among them.
  • Researchers’ preferences: Researchers design the study, write instructions, and guide labelers on edge cases, thereby influencing the alignment. More research is needed to understand the impact of different instructions and interfaces on the collected data and model behavior.
  • Customer prompts: Training data includes prompts from OpenAI customers using the API. There is potential misalignment between customer goals and end-user well-being.
  • Customer representation: The customers are not representative of all potential or current LM users. The initial user base was biased towards OpenAI’s networks.

Challenges and future directions:

  • Designing a fair and transparent alignment process is complex.
  • This paper demonstrates that the alignment method can work for a specific human reference group but doesn’t claim these group preferences are ideal.
  • Multiple stakeholders need consideration, including model trainers, developers, end-users, and the broader impacted population.
  • Aligning a system to everyone’s preferences simultaneously is impossible, and not all trade-offs will be universally endorsed.
  • One potential approach is to train models for different group preferences so that it can reflect diverse values. However, this may still impact broader society, raising decisions about prioritizing preferences.


Limitations

Methodology:

  • Contractor influence: InstructGPT is influenced by the human feedback from about 40 contractors.
    • Contractors’ identity, beliefs, cultural backgrounds, and personal history may affect their judgments.
    • They were selected based on their performance with sensitive prompts and labeling tasks.
    • The small team size allowed for better communication but is not representative of the broader population will use the models.
    • They are mostly English-speaking, and the data is almost entirely in English.
  • Data collection improvements: Most comparisons are labeled by only one contractor to reduce costs.
    • Multiple labelings could help identify disagreement areas, indicating where a single model may not align with all labelers.
    • Averaging labeler preferences for disagreements might not be ideal, especially for minority groups, whose preferences should be weighted more heavily.

Models:

  • Imcomplete alignment and safety: InstructGPT is not fully aligned or safe.
    • It still generates toxic or biased outputs, misinformations, and sexual or violent content.
    • It sometimes fails to generate reasonable outputs for certain inputs.
  • Following potentially harmful instructions: InstructGPT often follows instructions even if it could lead to real-world harm.
    • It produces more toxic outputs than GPT-3 when instructed to be maximally biased.


[Paper] Deep learning for image super-resolution: A survey (2020)

Wang, Zhihao, Jian Chen, and Steven CH Hoi. “Deep learning for image super-resolution: A survey.” IEEE transactions on pattern analysis and machine intelligence 43.10 (2020): 3365-3387.

Paper Link

Introduction

  • Super-resolution (SR) is the process of enhancing the resolution of images, transforming low-resolution (LR) images to high-resolution (HR) images.
  • SR is an ill-posed problem due to the existence of multiple HR images for a single LR image.
  • Deep learning has significantly advanced SR, with approaches like CNNs (SRCNN) and GANs (SRGAN).


Problem Setting and Terminology

  • Problem Definition: Developing a super-resolution model to approximate HR images from LR inputs.
  • Image Quality Assessment (IQA): Methods include subjective human perception and objective computational techniques, classified into full-reference, reduced-reference, and no-reference methods.


Supervised Super-Resolution

SR Framework

  • Pre-Upsampling Framework: Uses traditional upsampling followed by deep neural networks (e.g., SRCNN).
  • Post-Upsampling Framework: Employs end-to-end deep learning models for upsampling.
  • Progressive Upsampling Framework: Utilizes cascades of CNNs for step-by-step refinement of images.
  • Iterative Up-and-Down Sampling: Incorporates methods like DBPN and SRFBN for capturing LR-HR dependencies.

Upsampling Methods

Interpolation-Based

Includes nearest-neighbor, bilinear, and bicubic interpolation. These are traditional techniques used to resize images before the advent of deep learning-based methods.

Learning-Based

Utilizes transposed convolution layers and sub-pixel layers for end-to-end learning.

  1. Transposed Convolution Layer (Deconvolution Layer): Predicts the possible input based on feature maps sized like the convolution output for resolution, expanding the image by inserting zeros and performing convolution.

    deconv

    • This method enlarges the image size while maintaining a connectivity pattern, but it can cause uneven overlapping on each axis, leading to checkerboard-like artifacts that can affect SR performance.
  2. Sub-Pixel Layer (Pixelshuffle): Generates plurality of channels by convolution and the reshape them.

    subpixel

    • Given an input size $(h \times w \times c)$, it generates $s^2$ times channels, where $s$ is a scaling factor. The output size becomes $(h \times w \times s^2c)$, which is then reshaped(shuffled) to $(sh \times sw \times c)$.
    • This method maintains a larger receptive field than the transposed convolution layer, providing more contextual and realistic details. However, the distribution of the receptive field can be uneven, leading to artifacts near the boundaries of different blocks.

Network Design

networks

Residual Learning

  • Simplifies learning by focusing on the residuals between LR and HR images instead of learning a direct mapping. This approach reduces the complexity of the transformation task.
  • By learning only the difference (residuals) between the input and the target image, the model can focus on fine details, resulting in better performance and faster convergence.
  • Example: The ResNet architecture uses residual blocks to enhance the ability of very deep networks to learn effectively without vanishing gradients.

Recursive Learning

  • Repeatedly applies the same modules to capture higher-level features while maintaining a manageable number of parameters.
  • Allows the network to refine features iteratively, leading to more detailed and accurate image reconstructions.
  • Example: Deep Recursive Convolutional Network (DRCN) utilizes a single convolutional layer applied multiple times to expand the receptive field without increasing the number of parameters significantly.

Multi-Path Learning

  1. Local Multi-Path Learning
    • Extracts features through multiple parallel paths which then get fused to provide better modeling capabilities. This approach helps in capturing different aspects of the image simultaneously.
    • Different paths can focus on various scales or types of features, which are then combined to improve the overall representation.
    • Example: Multi-scale Residual Network (MSRN) uses multiple convolutional layers with different kernel sizes to capture multi-scale features.
  2. Scale-Specific Multi-Path Learning
    • Involves having separate paths for different scaling factors within a single network, allowing the network to handle multiple scales more effectively.
    • Example: MDSR (Multi-Scale Deep Super-Resolution) shares most network parameters but has scale-specific layers to handle different upscaling factors.

Dense Connections

  • Enhances gradient flow and feature reuse by connecting each layer to every other layer in a feed-forward fashion. This ensures that gradients can flow directly to earlier layers, improving learning efficiency.
  • Promotes feature reuse, leading to more efficient and compact networks.
  • Example: DenseNet connects each layer to every other layer, facilitating better feature propagation and reducing the risk of gradient vanishing.

Group Convolution

  • Splits the input channels into groups and performs convolutions within each group. This reduces the computational complexity and number of parameters.
  • Often used in lightweight models to balance performance and efficiency.
  • Example: Xception and MobileNet architectures use depthwise separable convolutions, a type of group convolution, to reduce the number of parameters and computation.

Pyramid Pooling

  • Uses pooling operations at multiple scales to capture both global and local context information. This helps in understanding the image at different resolutions.
  • Example: PSPNet (Pyramid Scene Parsing Network) uses pyramid pooling to aggregate contextual information from different scales, which is then combined to enhance the feature representation.

Attention Mechanisms

  1. Channel Attention
    • Focuses on the interdependencies between feature channels. It assigns different weights to different channels, enhancing important features and suppressing less useful ones.
    • Example: Squeeze-and-Excitation Networks (SENet) uses a squeeze operation to aggregate feature maps across spatial dimensions, followed by an excitation operation that recalibrates channel-wise feature responses.
  2. Spatial Attention
    • Focuses on the spatial location of important features. It assigns weights to different spatial regions, allowing the model to focus on relevant areas of the image.
    • Example: Convolutional Block Attention Module (CBAM) combines channel and spatial attention to improve representation by focusing on meaningful parts of the image.
  3. Non-Local Attention
    • Captures long-range dependencies between distant pixels. This is particularly useful for super-resolution tasks where global context is important.
    • Example: Non-local Neural Networks use a self-attention mechanism to compute relationships between all pairs of positions in the feature map, allowing the model to capture global context and dependencies.
  4. Combined Attention
    • Some networks combine multiple types of attention mechanisms to leverage the strengths of each. For instance, combining channel and spatial attention can provide a more comprehensive attention mechanism.
    • Example: The Residual Channel Attention Network (RCAN) uses channel attention modules within a residual network structure to enhance the network’s ability to capture important features for image super-resolution

Learning Strategies

  • Loss Functions: Early methods used pixel-wise L2 loss, while newer approaches incorporate more complex losses like content loss, adversarial loss, and perceptual loss to improve the quality of the reconstructed images.
  • Training Techniques: Techniques such as curriculum learning, multi-supervision, and progressive learning are used to enhance the training process and improve model performance.


Unsupervised Super-Resolution

Unsupervised methods do not rely on paired LR-HR datasets. Instead, they use generative models and adversarial training to learn the mapping from LR to HR images. Techniques include CycleGAN, which learns the transformation by mapping LR images to HR images and vice versa.

Domain-Specific Super-Resolution

Domain-specific methods focus on specific applications such as face SR, text SR, and medical image SR. These methods leverage domain knowledge to improve the quality of SR in specific contexts.

Benchmark Datasets and Performance Evaluation

Several benchmark datasets are used for evaluating SR models, including Set5, Set14, BSD100, and Urban100. Common evaluation metrics include Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM).

Metrics

While PSNR is widely used, it does not always correlate well with human perception of image quality. SSIM addresses this by considering luminance, contrast, and structure.

  • PSNR is one of the most popular reconstruction quality measurements of lossy transformation. In the context of SR, it’s defined via the maximum pixel value ($L$) and the mean squared error (MSE) between the images.

    \[PSNR=10\cdot\log_{10}\big({L^2\over{1\over N}\sum_{i=1}^N(I(i)-\hat{I}(i))^2}\big)\]
    • $I(i)$ and $\hat{I}(i)$ represent the pixel values of the original and reconstructed images, respectively. and $N$ is the total number of pixels.
  • SSIM measures the structural similarity between images based on independent comparisons in terms of luminance, contrast, and structures. It considers the human visual system (HVS) is highly adapted to extract image structures.

    \[SSIM(I,\hat{I})={(2\mu_I\mu_{\hat{I}}+C_1)(2\sigma_{I\hat{I}}+C_2)\over(\mu_I^2+\mu_{\hat{I}}^2+C_1)(\sigma_I^2+\sigma_\hat{I}^2+C_2)}\]
    • $\mu_I$ and $\mu_\hat{I}$ are the mean pixel values of the original and reconstructed images, respectively. $\sigma_I^2$ and $\sigma_\hat{I}^2$ are the variances, and $\sigma_{I\hat{I}}$ is the covariance of $I$ and $\hat{I}$. $C_1$ and $C_2$ are constants to stabilize the division when the denominators are small.

Challenges and Future Directions

  • Scalability: Developing SR models that can handle varying scales and resolutions efficiently.
  • Real-World Applications: Enhancing SR models to perform well on real-world images with diverse degradation.
  • Efficiency: Reducing computational complexity and memory usage while maintaining high performance.
  • Generality: Creating SR models that generalize well across different types of images and domains.
  • Perceptual Quality: Improving the perceptual quality of SR images, ensuring that they are visually appealing and free from artifacts.

Conclusion

The survey paper provides an in-depth review of deep learning-based super-resolution techniques, categorizing them into supervised, unsupervised, and domain-specific methods. It discusses various network architectures, upsampling techniques, and learning strategies, highlighting the advancements and challenges in the field. The paper also covers benchmark datasets and performance evaluation metrics, providing a comprehensive overview of the current state of image super-resolution research.