[Paper] Llama: Open and efficient foundation language models (2023)
Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).
Points
- Efficient inference with smaller models: LLaMA models prioritize inference efficiency by using smaller models trained on large datasets, achieving state-of-the-art (SOTA) performance across benchmarks while being cost-effective during inference.
- Publicly available data: Unlikely many existing models that rely on proprietary data, LLaMA models are exclusively trained on publicly available datasets, ensuring transparency and compatibility with open-source principles.
- Broad Benchmark Performance: LLaMA models demonstrate competitive performance on a wide range of tasks, including common sense reasoning, question answering, reading comprehension, etc,.
Background
Large language models (LLMs) have demonstrated remarkable capabilities in performing new tasks with minimal instruction or examples, thanks to their vast size. However, recent research suggests that smaller models trained on larger datasets can achieve superior performance, highlighting the importance of efficiency during inference rather than training.
Approach
LLaMA is a series of language models (LMs) designed to optimize performance across various inference budgets, ranging from 7B to 65B parameters, using only publicly available data.
Pre-training data
The dataset mixture used cover diverse domains and is entirely publicly avaliable, ensuring compatibility with open-source principles:
- English CommonCrawl [67%]: Preprocessed from five CommonCrawl dumps (2017-2020)., filtered for non-English and low-quality content.
- C4 [15%]: Similarly preprocessed to CommonCrawl, to enhance performance.
- Github [4.5%]: Filtered for line length and alphanumeric content from Google BigQuery.
- Wikipedia [4.5%]: Dumps from mid-2022, covering multiple languages.
- Gutenberg and Books3 [4.5%]: Publicly available books with redundant content removed.
- ArXiv [2.5%]: Includes scientific data, with non-essential content removed.
- Stack Exchange [2%]: High-quality Q&A content sorted by score.
Tokenization
- Byte Pair Encoding (BPE) tokenizer used.
- Splits numbers into digits and decomposes unknown UTF-8 characters.
- The training dataset contains approximately 1.4T tokens, with minimal repetition (fig 1).
Architecture
LLaMA models are based on transformer architecture with key modifications:
- Pre-normalization [GPT3]: Normalizes the input of each transformer sub-layer, enhancing training stability using RMSNorm.
- SwiGLU activation function [PaLM]: Uses SwiGLU instead of ReLU, improving performance with a dimension of $2\over3 4d$ instead of $4d$ as in PaLM.
- Rotary Embeddings [GPTNeo]: Employs Rotary embeddings (RoPE) instead of absolute positional embeddings at each layer of the network.
Optimizer
Trained using the AdamW optimizer with:
- $\beta_1=0.9, \beta_2=0.95$.
- Cosine learning rate schedule, ending at 10% of the maximal rate.
- Weight decay of 0.1 and gradient clipping of 1.0.
- 2,000 warmup-steps, with varying learning rates and batch size with the size of the model (table 2).
Efficient implementation
- Causal multi-head attention: Efficient implementation using xformer library to reduce memory and runtime.
- Activation reductions: Uses checkpointing to recompute activations during the backward pass, especially for computationally expensive layers.
Main Results
Evaluated on 20 benchmarks for zero-shot and few-shot tasks, compared to non-public models (GPT-3, Gopher, Chinchilla, PaLM) and open-sourced models (OPT, GPT-J, GPT-Neo).
Common sense reasonging
- Benchmarks: Eight standard benchmarks such as BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA. Theses datasets include Cloze and Winograd style tasks and multiple choice question answering (QA).
- Results
- LLaMA-65B outperforms Chinchilla 70B and PaLM-540B on most benchmarks except BoolQ.
- LLaMA-13B outperforms GPT-3 on most benchmarks despite being significantly smaller.
Close-book question answering
- Benchmarks: Natural Questions and TriviaQA. The models report exact match performance where the models do not have access to documents that contain avidence to answer the question.
- Results:
- LLaMA-65B achieve state-of-the-art (SOTA) performance in zero-shot and few-shot settings.
- LLaMA-13B is competitive with GPT-3 and Chinchilla which are larger models.
Reading comprehension
- Benchmark: RACE reading comprehension, collected from English reading comprehension exams in middle and high school Chinese students.
- Results: LLaMA-65B is competitive with PaLM-540B, and LLaMA-13B outperforms GPT-3.
Mathematical reasoning
- Benchmarks: MATH and GSM8k. MATH contains 12K math problems of middle and high school. GSM8k is a set of middle school math problems.
- Results: LLaMA-65B outperforms Minerva-62B on GSM8k.
- Minerva is a series of PaLM models fine-tuned on 38.5B tokens extracted from ArXiv and Math Web Pages. Both PaLM and LLaMA, however, are not finetuned on math data.
Code generation
- Benchmarks: HumanEval and MBPP. The models are evaluated about their ability to write code from a natural language description.
- Results:
- LLaMA models outperform other models, including LaMDA and PaLM. LLaMA-13B outperforms LaMDA-137B. LLaMA 65B outperforms PaLM-62B.
- Fine-tuning on code-specific tokens further improves performance.
Massive multitask language understanding
- Massive multitask language understanding (MMLU) consists of multiple choice questions covering various domains of knowledge, like humanities, STEM and social sciences.
- Results: LLaMA-65B underperforms compared to Chinchilla-70B and PaLM-540B, possibly due to limited academic data.
Evolution of performance during training
- Performance improves steadily, and correlates with the training perplexity of the model.
- SIQA and WinoGrande are the exceptions: SIQA may not be reliable as performance is varied, and performance doesn’t correlate with training perplexity on WinoGrande.
Instruction Fine-tuning
Fine-tuning improves performance and futher the ability to follow instructions. LLaMA-I is trained on MMLU with instructions and compared with OPT-IML and Flan-PaLM series which fine-tuned with moderate sizes.
- LLaMA-I with 65B parameter size outperforms existing instruction fine-tuned models, but remains behind GPT ‘code-davinci-002’.
Bias, Toxicity and Misinformation
LLMs have been showed to be biased to content of training data, and to generate toxic content. Evaluated using benchmarks for toxic content generation and stereotypes detection.
RealToxicityPrompts
Indicates how toxic is a model. The toxicity score is automatically evaluated by making a request to PerspectiveAPI, ranging from 0 (non-toxic) to 1 (toxic).
- Comparable to other models, with larger models exhibiting more toxicity, especially for “Respectiful” prompts.
- It can be suggested that the relation between toxicity and model size may only apply within a model family.
CrowS-Pairs
Evaluates the biases in a model with 9 categories: gender, religion, race, sexual orientation, age, nationality, disability, physical appearance and socioenconomic status.
- LLaMA shows slight biases, particularly in the religion, age, and gender categories. This may be come from CommonCrawl dataset.
WinoGender
Used to investigate the bias of a model on the gender category. It evaluates if the model’s co-reference resolution performance is impacted by the gender of the pronoun.
- Performance varies by pronoun type: The models have better performance “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns.
- Larger models showing more gender bias: For “gotcha” cases, LLaMA-65B makes more errors, showing that it capture biases on gender.
- “gotcha” cases are in which the pronoun does not match the majority gender of the occupation, and the occupation is the correct answer.
TruthfulQA
Evaluates the a model’s ability to identify true claims and measures the risk of generating misinformation or false claims. This assesses the truthfulness of a model’s responses.
- LLaMA models show better truthfulness compared to GPT-3. However the correct answer rate remains low, indicating a potential for misinformation.
Carbon footprint
Details the environmental impact of training and deploying these models.