[Paper] Llama: Open and efficient foundation language models (2023)

10 May 2023 #nlp #llm

Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

Points

Efficient inference with smaller models: LLaMA models prioritize inference efficiency by using smaller models trained on large datasets, achieving state-of-the-art (SOTA) performance across benchmarks while being cost-effective during inference.
Publicly available data: Unlikely many existing models that rely on proprietary data, LLaMA models are exclusively trained on publicly available datasets, ensuring transparency and compatibility with open-source principles.
Broad Benchmark Performance: LLaMA models demonstrate competitive performance on a wide range of tasks, including common sense reasoning, question answering, reading comprehension, etc,.

Background

Large language models (LLMs) have demonstrated remarkable capabilities in performing new tasks with minimal instruction or examples, thanks to their vast size. However, recent research suggests that smaller models trained on larger datasets can achieve superior performance, highlighting the importance of efficiency during inference rather than training.

Approach

LLaMA is a series of language models (LMs) designed to optimize performance across various inference budgets, ranging from 7B to 65B parameters, using only publicly available data.

Pre-training data

The dataset mixture used cover diverse domains and is entirely publicly avaliable, ensuring compatibility with open-source principles:

English CommonCrawl [67%]: Preprocessed from five CommonCrawl dumps (2017-2020)., filtered for non-English and low-quality content.
C4 [15%]: Similarly preprocessed to CommonCrawl, to enhance performance.
Github [4.5%]: Filtered for line length and alphanumeric content from Google BigQuery.
Wikipedia [4.5%]: Dumps from mid-2022, covering multiple languages.
Gutenberg and Books3 [4.5%]: Publicly available books with redundant content removed.
ArXiv [2.5%]: Includes scientific data, with non-essential content removed.
Stack Exchange [2%]: High-quality Q&A content sorted by score.

Tokenization

Byte Pair Encoding (BPE) tokenizer used.
Splits numbers into digits and decomposes unknown UTF-8 characters.
The training dataset contains approximately 1.4T tokens, with minimal repetition (fig 1).

Architecture

LLaMA models are based on transformer architecture with key modifications:

Pre-normalization [GPT3]: Normalizes the input of each transformer sub-layer, enhancing training stability using RMSNorm.
SwiGLU activation function [PaLM]: Uses SwiGLU instead of ReLU, improving performance with a dimension of $2\over3 4d$ instead of $4d$ as in PaLM.
Rotary Embeddings [GPTNeo]: Employs Rotary embeddings (RoPE) instead of absolute positional embeddings at each layer of the network.

Optimizer

Trained using the AdamW optimizer with:

$\beta_1=0.9, \beta_2=0.95$.
Cosine learning rate schedule, ending at 10% of the maximal rate.
Weight decay of 0.1 and gradient clipping of 1.0.
2,000 warmup-steps, with varying learning rates and batch size with the size of the model (table 2).

Efficient implementation

Causal multi-head attention: Efficient implementation using xformer library to reduce memory and runtime.
Activation reductions: Uses checkpointing to recompute activations during the backward pass, especially for computationally expensive layers.

Main Results

Evaluated on 20 benchmarks for zero-shot and few-shot tasks, compared to non-public models (GPT-3, Gopher, Chinchilla, PaLM) and open-sourced models (OPT, GPT-J, GPT-Neo).

Common sense reasonging

table3

Benchmarks: Eight standard benchmarks such as BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA. Theses datasets include Cloze and Winograd style tasks and multiple choice question answering (QA).
Results
- LLaMA-65B outperforms Chinchilla 70B and PaLM-540B on most benchmarks except BoolQ.
- LLaMA-13B outperforms GPT-3 on most benchmarks despite being significantly smaller.

Close-book question answering

table4 table5

Benchmarks: Natural Questions and TriviaQA. The models report exact match performance where the models do not have access to documents that contain avidence to answer the question.
Results:
- LLaMA-65B achieve state-of-the-art (SOTA) performance in zero-shot and few-shot settings.
- LLaMA-13B is competitive with GPT-3 and Chinchilla which are larger models.

Reading comprehension

table6

Benchmark: RACE reading comprehension, collected from English reading comprehension exams in middle and high school Chinese students.
Results: LLaMA-65B is competitive with PaLM-540B, and LLaMA-13B outperforms GPT-3.

Mathematical reasoning

table7

Benchmarks: MATH and GSM8k. MATH contains 12K math problems of middle and high school. GSM8k is a set of middle school math problems.
Results: LLaMA-65B outperforms Minerva-62B on GSM8k.
- Minerva is a series of PaLM models fine-tuned on 38.5B tokens extracted from ArXiv and Math Web Pages. Both PaLM and LLaMA, however, are not finetuned on math data.

Code generation

table8

Benchmarks: HumanEval and MBPP. The models are evaluated about their ability to write code from a natural language description.
Results:
- LLaMA models outperform other models, including LaMDA and PaLM. LLaMA-13B outperforms LaMDA-137B. LLaMA 65B outperforms PaLM-62B.
- Fine-tuning on code-specific tokens further improves performance.

Massive multitask language understanding

table9

Massive multitask language understanding (MMLU) consists of multiple choice questions covering various domains of knowledge, like humanities, STEM and social sciences.
Results: LLaMA-65B underperforms compared to Chinchilla-70B and PaLM-540B, possibly due to limited academic data.

Evolution of performance during training

fig2

Performance improves steadily, and correlates with the training perplexity of the model.
SIQA and WinoGrande are the exceptions: SIQA may not be reliable as performance is varied, and performance doesn’t correlate with training perplexity on WinoGrande.

Instruction Fine-tuning

Fine-tuning improves performance and futher the ability to follow instructions. LLaMA-I is trained on MMLU with instructions and compared with OPT-IML and Flan-PaLM series which fine-tuned with moderate sizes.

table10

LLaMA-I with 65B parameter size outperforms existing instruction fine-tuned models, but remains behind GPT ‘code-davinci-002’.

Bias, Toxicity and Misinformation

LLMs have been showed to be biased to content of training data, and to generate toxic content. Evaluated using benchmarks for toxic content generation and stereotypes detection.

RealToxicityPrompts

Indicates how toxic is a model. The toxicity score is automatically evaluated by making a request to PerspectiveAPI, ranging from 0 (non-toxic) to 1 (toxic).

table11

Comparable to other models, with larger models exhibiting more toxicity, especially for “Respectiful” prompts.
It can be suggested that the relation between toxicity and model size may only apply within a model family.

CrowS-Pairs

Evaluates the biases in a model with 9 categories: gender, religion, race, sexual orientation, age, nationality, disability, physical appearance and socioenconomic status.

table12

LLaMA shows slight biases, particularly in the religion, age, and gender categories. This may be come from CommonCrawl dataset.

WinoGender

Used to investigate the bias of a model on the gender category. It evaluates if the model’s co-reference resolution performance is impacted by the gender of the pronoun.

table13

Performance varies by pronoun type: The models have better performance “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns.
Larger models showing more gender bias: For “gotcha” cases, LLaMA-65B makes more errors, showing that it capture biases on gender.
- “gotcha” cases are in which the pronoun does not match the majority gender of the occupation, and the occupation is the correct answer.

TruthfulQA

Evaluates the a model’s ability to identify true claims and measures the risk of generating misinformation or false claims. This assesses the truthfulness of a model’s responses.

table14

LLaMA models show better truthfulness compared to GPT-3. However the correct answer rate remains low, indicating a potential for misinformation.

Carbon footprint

Details the environmental impact of training and deploying these models.

table15

Coffee Chat Brewing AI Knowledge