LLM Tech Report Notes (updated on 01/22/2025)
- DeepSeek-R1 (01/20/2025)
- Kimi K1.5 (01/20/2025)
- MiniMax-01 (01/15/2025)
- Qwen2.5-Math-PRM (01/13/2025)
- OLMo 2 (12/31/2024)
- Deepseek-V3 (12/16/2024)
- Qwen2.5 (12/19/2024)
- Phi-4 (12/12/2024)
- TÜLU 3 (12/06/2024)
- Llama 3 (08/15/2024)
- OLMo (07/07/2024)
DeepSeek-R1 (01/20/2025)
- DeepSeek-R1-Zero: use DeepSeek-V3-Base as the base model and employ GRPO as the RL framework to improve model performance in reasoning. During training.
- DeepSeek-R1:
- (DeepSeek-V3-Base)->(DeepSeek-V3-SFT1) cold-start SFT with thousands of data from in-context long CoT prompting + DeepSeek-R1Zero readable outputs
- (DeepSeek-V3-SFT1)->(DeepSeek-V3-RL) reasoning-oriented RL like DeepSeek-R1-Zero.
- (DeepSeek-V3-Base)->(DeepSeek-V3-SFT2) two epoch fine-tuning DeepSeek-V3-Base using 600k reasoning related training samples via rejection sampling on the RL checkpoint + 200k non-reasoning training samples
- (DeepSeek-V3-SFT2)->(DeepSeek-R1) After fine-tuning, an additional RL process, taking into account prompts from all scenarios.
- Do not use ORM or PRM, use rule-based reward system: Accuracy rewards, Format rewards.
- Emphisize that neural reward model may suffer from reward hacking in the large-scale reinforcement learning process
- Designing a straightforward template that guides the base model to adhere to specified instructions
- (Interesting) DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This improvement is not the result of external adjustments but rather an intrinsic development within the model.
- (Interesting) Behaviors such as reflection are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment.
- Aha Moment of DeepSeek-R1-Zero: DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach.
- DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing.
- Distillation from DeepSeek-R1 to smaller dense models works well. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities
Kimi K1.5 (01/20/2025)
- Long-CoT Supervised Fine-Tuning
- construct a small yet high-quality long-CoT warmup dataset
- Reinforcement Learning
- For verifiable problems, the reward is predefined criteria or rules. For problems with free-form ground truth, us a reward model r(x, y, y∗).
- Length Penalty to avoid overthinking phenomenon Several approaches for this long2short problem, including model merging, shortest rejection sampling, DPO, and long2short RL.
MiniMax-01 (01/15/2025)
- 456 billion parameters, 45.9 billion activations, and 32 experts, 1.5T tokens for pre-training
- good to know that the naive linear attention \(O = Norm(Q(K^{\top}V))\) has efficiency issues due the cumulative sum operation when consider the causal mask
- Need to learn the detail of Lightning Attention https://sustcsonglin.github.io/assets/pdf/talk_250117.pdf
- Transformer-style block, with each comprises a channel mixer (an attention block, lightning attention and softmax attention) and a feature mixer (an MLP block, an MoE that incorporates multiple feed-forward networks (FFNs))
- hybrid architecture have yielded promising results, delve deeper into its potential through two variants: hybrid-cosformer2 and hybrid-hgrn2.
- Almost perfect long-context understanding ability, with a context window of 1M tokens
Qwen2.5-Math-PRM (01/13/2025)
- Commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods.
- Reveal the potential bias in using response-level BoN evaluation alone for PRMs
- TBD
OLMo 2 (12/31/2024)
- up to 5T tokens, 95% derived from web data; 7B 13B parameters
- Reordered norm and QK-norm. \(h ∶= x + \text{RMSNorm}(\text{Attention}(x)); h_{out} ∶= h + \text{RMSNorm}(\text{MLP}(x))\)
- Data can be a cause of both gradient norm and loss spikes. When investigating training batches at which spikes occurred, we found a high prevalence of instances containing long, repeated n-gram sequences
- improving training stability from OLMo 2’s initialization, initialize every parameter from a normal distribution with a mean of \(0\) and a standard deviation of \(0.02\)
- decreasing the AdamW \(\epsilon\) from \(10^{−5}\) to \(10^{−8}\)
- confirm the effectiveness of this approach, also known as model souping, on six different mid-training mixes
- three phases of training: SFT, preference tuning with DPO, and RLVR
- turn off weight decay for embeddings and observe that embedding norms settle in a healthy region.
Deepseek-V3 (12/16/2024)
- multi-token prediction objective, the acceptance rate of 2nd token prediction is 85% ~ 90%
- knowledge distillation from DeepSeek-R1, notably improves its reasoning performance
- balanced expert loading introduce a bias term for each expert to help determine the top-K routing
- DualPipe: overlap the computation and communication within forward and backward chunks.
- fp8 quantization during training: introduce a fine-grained quantization strategy for fp8
- an efficient and lightweight training framework, HAI-LLM. (might be the impressive engeering basis)
- numbers: 14.8T tokens for pre-training
- RMSNorm recomputation during back-propagation
- adopt the BF16 for first and second moments in the AdamW
- do not incorporate cross-sample attention masking during training
- use document packing method for data integrity
- incorporate the FIM strategy in the pre-training
- shared embedding and output head for multi-token prediction (due the DualPipe implementation)
- not use costly tensor parallelism
- suggestions on hardware design
- higher FP8 GEMM accumulation precision
- tile- and block-wise quantization
- online quantization
- transposed GEMM operations
Qwen2.5 (12/19/2024)
- 0.5B, 1.5B, 3B, 7B, 14B, 72B; 18T token for pre-training
- Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct generate synthetic data in mathematics, code, and knowledge domains
- increase RoPE base from 10,000 to 1,000,000 using the ABF technique
- develop long-response datasets, capable of generating high-quality 8,192 tokens
Phi-4 (12/12/2024)
- numbers: 14B, 10T tokens
- 50 broad types, 400B-token synthetic datasets, spanning an array of topics, skills, and natures of interaction
- question-answer data contributed significantly to various capabilities, such as mathematical reasoning and academic performance
- one round of SFT, one round of DPO on data from our pivotal token search method, and one round of DPO on full length preference pairs
- 8B tokens of data for SFT, all formatted in the chatml format
TÜLU 3 (12/06/2024)
- synthetic data generation for target skills such as precise instruction following, math and coding
- safety SFT data was generally orthogonal to our other datasets
- changing the chat template, replacing the newlines at the end of assistant messages with an eos
- SFT performance noticeably varies based on the seed
- model soup does not always outperform the best single run
- use length-normalized DPO for tuning our preference data mixtures and generation methods
- scaling the number of unique prompts improve downstream DPO performance
- for our final DPO models we decided on using a learning rate of \(2.0 × 10^{-7}\)
- introduce (RLVR), a novel method for training llm on tasks with verifiable outcomes
- RLVR focus on two domains (mathematics, exact instruction following) and three evaluations (GSM8K, MATH, IFEval)
Llama 3 (08/15/2024)
- 405B parameters on 15.6T tokens using a context window of 8K tokens.
- supported context window to 128K tokens
- supervised finetuning on instruction tuning data and Direct Preference Optimization
- annealing on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks
- Llama 3 405B is trained on up to 16K H100 GPUs
- use fully sharded data parallelism (FSDP) for training
- design a new multi-message chat protocol which uses various special header and termination tokens.
- average models obtained from experiments using various versions of data or hyperparameters at each RM, SFT, or DPO stage
OLMo (07/07/2024)
- 1B and 7B models, 2T tokens Dolma dataset
- use up to 256 nodes on this cluster, where each node consists of 4x AMD MI250X GPUs with 128GB of memory5 and 800Gbps of interconnect
- release model weights, training data and training and evaluation code.