Research Article - Xiaotian Han | Academic Insights

LLM Tech Report

1.Ring-lite (06/18/2025)
2.Gemini 2.5 (06/17/2025)
3.AceReason-Nemotron 1.1 (06/18/2025)
4.MiniMax-M1 (06/16/2025)
5.Magistral (06/12/2025)
6.dots.llm1 (06/06/2025)
7.Qwen3 Embedding (06/05/2025)
8.OpenThoughts (06/05/2025)
9.MiMo-VL (06/04/2025)
10.Qwen 3 (05/14/2025)
11.Llama-Nemotron (05/14/2025)
12.MiMo (05/12/2025)
13.Seed 1.5-VL (05/11/2025)
14.Phi-4-reasoning (04/30/2025)
15.Seed 1.5-thinking (04/29/2025)
16.Kimi-Audio (04/25/2025)
17.Trillion 7B (04/21/2025)
18.Seedream 3.0 (04/16/2025)
19.Kimi-VL (04/15/2025)
20.Tulu 3 (04/14/2025)
21.Qwen2.5-Omni (03/26/2025)
22.Gemma 3 (03/25/2025)
23.Phi-4-Mini (03/07/2025)
24.Qwen2.5-VL (02/19/2025)
25.Janus-Pro (01/29/2025)
26.DeepSeek-R1 (01/20/2025)
27.Kimi K1.5 (01/20/2025)
28.MiniMax-01 (01/15/2025)
29.Qwen2.5-Math-PRM (01/13/2025)
30.OLMo 2 (12/31/2024)
31.Deepseek-V3 (12/16/2024)
32.Qwen2.5 (12/19/2024)
33.Phi-4 (12/12/2024)
34.TÜLU 3 (12/06/2024)
35.Llama 3 (08/15/2024)
36.OLMo (07/07/2024)

1. Ring-lite (06/18/2025)

  • pdf, huggingface
  • Based on Ling-lite, a 16.8B parameter model with 2.75B activated parameters
  • Math (64.5%), Code (25.5%), and Science (9.2%, encompassing some high-quality and difficult samples generated by SHARP
  • C3PO: Constrained Contextual Computation Policy Optimization

2. Gemini 2.5 (06/17/2025)

  • pdf
  • sparse mixture-of-experts (MoE)
  • post-training: SFT, Reward Modeling (RM), RL stages
  • have increased the training compute allocated to RL
  • coupled with a focus on verifiable rewards and model-based generative rewards

3. AceReason-Nemotron 1.1 (06/18/2025)

  • pdf, huggingface
  • Scaling of SFT data consistently improves performance
  • Performance improves progressively over epochs
  • Math-only RL significantly improves code reasoning
  • increasing the number of responses for each prompt serves as a practical alternative to boost the performance of SFT model.

4. MiniMax-M1 (06/16/2025)

  • pdf, huggingface

  • 456B parameters with 45.9N parameters activated

  • context length of 1 million tokens

  • CISPO (Clipped IS-weight Policy Optimization)

    • basically it is policy gradient with probabilities ratio clipping.
    • without weight clipping, CISPO reduces to the standard policy gradient objective
    • impose a lower bound on the IS weight by setting εIS low to a large value
    • dynamic sampling and length penalty techniques
    • no KL penalty term
    • on Qwen2.5-32B-base, CISPO significantly outperforms both DAPO and GRPO with the same number of training steps

5. Magistral (06/12/2025)

  • pdf, https://huggingface.co/mistralai/Magistral-Small-2506
  • Magistral Small (24B, release) and Magistral Medium, based on the Mistral Small 3 and Mistral Medium 3 models
  • adapting GRPO:

    • eliminate KL divergence
    • normalize the loss by first adding token-wise loss
    • normalize the advantages in each minibatch
    • relax the trust region's upper bound
    • filter out all groups with zero advantage
  • reward shaping:

    • formatting:
    • correctness
    • length: soft length penalty to signal the model that the hard cutoff on maximal completion length is near
    • language consistency: If the classifier indicates that all three parts used the same language, we give an additional reward of 0.1
  • infrastructure:

    • 1) Generators continuously output completions to prompts from input data sources.
    • 2) Whenever a completion is finished, it is sent to the appropriate verifier.
    • 3) Each sequence is sent to a different data parallel group using a pre-set permutation until every data parallel group has enough sequences to form a batch.
    • 4) A single gradient step is performed and the trainer and generators are updated.

6. dots.llm1 (06/06/2025)

  • pdf, huggingface
  • MoE that activates 14 billion parameters out of 142 billion parameters
  • pretrained on 11.2T high-quality tokens
  • adopt a sparse DeepSeekMoE framework
  • classic MHA combined with QK-Norm
  • auxiliary-loss-free strategy, which introduces a bias term for each expert, added to the corresponding affinity scores to determine the top-k routing
  • post-training: SFT using 400K instances
  • 1F1B based all-to-all communication and computation overlap solution

7. Qwen3 Embedding (06/05/2025)

8. OpenThoughts (06/05/2025)

  • pdf, huggingface
  • We use OpenMath-2-Math as our sole math question source, CodeGolf and OpenCodeReasoning as our code question sources, and StackExchangePhysics and OrganicChemistryPDFs as our science question sources.
  • We use difficulty-based filtering with GPT-4o-mini for code questions, and response length filtering with GPT-4.1-mini for math and science questions.
  • Our final pipeline uses 16× answers per question for all domains. It uses exact deduplication for math and science and no deduplication for code.
  • We do not perform answer filtering because no filtering strategy outperformed the baseline, which uses all the answers.
  • Across all domains, using QwQ-32B as a teacher model outperforms all other teacher models, yielding an average accuracy improvement of 1.9% and 2.6% over using DeepSeek-R1 as a teacher for code and math
  • OpenThinker3-7B is the best open-data reasoning model at the 7B scale, regardless of optimization algorithm choice (SFT, RL, or both)

9. MiMo-VL (06/04/2025)

  • pdf, huggingface
  • A four-stage pre-training phase

    • 1) projector warmup: freeze the ViT and LLM, Image-Caption Pairs
    • 2) vision-language alignment: ViT is then unfrozen, + Interleaved Data
    • 3) general multimodal pre-training: all parameters are trainable
    • 4)long-context SFT
  • three components: 1) a ViT encoder, 2) a MLP projector (3) MiMo-7B-Base
  • RLVF

    • verifiable STEM questions from open-source communities and proprietary K-12 collections
    • bounding box predictions
  • GRPO: single-step policy updates following response rollout, eliminating the need for a clipped surrogate training objective

10. Qwen 3 (05/14/2025)

  • pdf, huggingface
  • 6 dense models Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, and Qwen3-32B
  • 2 MoE models, Qwen3-30B-A3B and Qwen3-235B-A22B
  • Qwen3-235B-A22B, has a total of 235B parameters with 22B activated ones
  • we remove QKV-bias used in Qwen2 and introduce QK-Norm to ensure stable training for Qwen3
  • The Qwen3 MoE models have 128 total experts with 8 activated experts per token. Qwen3-MoE design excludes shared experts
  • pretrain a total of 36 trillion tokens
  • Qwen2.5-VL, PDF-like documents, amounting to trillions in total.
  • employ Qwen2.5, Qwen2.5-Math, and Qwen2.5-Coder models to synthesize trillions of text tokens, including textbooks, question-answering, instructions, and code snippets, covering dozens of domains
  • Pre-training Stage: (S1) General Stage: 30 trillion tokens, (S2) Reasoning Stage: 5T higher-quality tokens on STEM, coding, reasoning, and synthetic data, (S3) Long Context Stage: increase the base frequency from 10,000 to 1,000,000

11. Llama-Nemotron (05/14/2025)

  • pdf, huggingface
  • five stage training:

    • optimizing inference efficiency with neural architecture search (NAS) from the Llama 3
    • knowledge distillation and continued pretraining.
    • SFT on a mix of standard instruction data and reasoning traces from DeepSeek-R1
    • RL on complex mathematics and STEM datasets
    • alignment phase focused on instruction following and human preference.
  • neural architecture search: Attention removal and Variable FFN dimensions
  • both reasoning and non-reasoning data for supervised fine-tuning
  • For reasoning samples, include the system instruction "detailed thinking on", and for non-reasoning samples, we use "detailed thinking off"
  • GRPO: use a rollout prompt size of 72 and sample 16 responses per prompt with temperature = 1 and top_p = 1. During training, we set global batch size as 576 and conduct 2 gradient updates per rollout.
  • Accuracy rewards: serve the Llama-3.3-70B-Instruct to judge whether the policy’s predictions match the ground truth answer
  • "" and "</think>" tags when using "detailed thinking on" mode and check for the non-existence of thinking tags when using "detailed thinking off" mode.

12. MiMo (05/12/2025)

  • pdf, huggingface
  • pre-trained on 25 trillion tokens
  • MultiToken Prediction objective
  • GQA, pre-RMSNorm, SwiGLU activation and RoPE, similar to Llama and Qwen
  • our final SFT dataset comprises about 500K samples
  • learning rate of 3×105 and batch size of 128. Samples are packed to the maximum length of 32,768 tokens during training
  • two categories of verifiable problems, mathematics and code
  • GRPO: Removal of KL Loss, Dynamic Sampling, Clip-Higher
  • verl
  • an easy data resampling strategy. maintain an easy data pool, 10% sample from this easy data pool

13. Seed 1.5-VL (05/11/2025)

  • pdf
  • a 532M-parameter vision encoder and a MoE LLM of 20B active parameters

14. Phi-4-reasoning (04/30/2025)

  • pdf, huggingface

  • 14B parameters, SFT from Phi-4

  • highlight the benefits of careful data curation and SFT for reasoning language models

  • Phi-4 base model was pretrained using large innovative synthetic datasets specifically curated to prioritize reasoning and complex problem-solving

  • Seeds database

    • are used in both SFT for Phi-4-reasoning and RL for Phi-4-reasoning-plus.
    • across STEM disciplines and coding, also incorporating general-purpose question-answer style prompts.
    • include alignment-focused data aimed at enhancing model safety, mitigating potential harms, and promoting responsible AI practices
  • we found o3-mini with medium “reasoning effort” effort to have similar effect to DeepSeek-R1 when used as teachers

  • rewards: length-aware correctness, incompleteness, invalid “thinking” block, repetition penalty

  • We select as our RL checkpoint the model with the best observed AIME 2024 score, which is the model trained for 90 steps, over only ∼ 6k examples (and 8 trajectories of responses per example)

15. Seed 1.5-thinking (04/29/2025)

16. Kimi-Audio (04/25/2025)

17. Trillion 7B (04/21/2025)

  • pdf, huggingface
  • For post-training, we closely follow the Tülu 3 framework consisting of SFT, DPO, and RLVR.
  • 2T tokens, Multi-token Prediction
  • extend the RoPE base from 100,000 to 1,000,000 using the ABF

18. Seedream 3.0 (04/16/2025)

19. Kimi-VL (04/15/2025)

20. Tulu 3 (04/14/2025)

  • pdf, huggingface
  • Stage 1: Data Curation

    • Precise Instruction Following
    • Math and Coding
    • Noncompliance and Safety
  • Stage 2: Supervised Finetuning
  • Stage 3: Preference Tuning

    • Direct Preference Optimization
    • We find that length-normalized DPO works best, which uses the following objective:
  • Stage 4: Reinforcement Learning with Verifiable Rewards

    • max𝜋𝜃𝐸𝑦𝜋𝜃(𝑥)[𝑅RLVR(𝑥,𝑦)]=[𝑣(𝑥,𝑦)𝛽KL[𝜋𝜃(𝑦|𝑥)𝜋ref(𝑦|𝑥)] where 𝑣(𝑥,𝑦)=𝛼 if correct, 0 otherwise. and 𝛼=10 is a hyperparameter.
    • RLVR Data: GSM8K, MATH, and IFEval
    • 30,000 prompts with ground truth labels
    • Initialize the Value model from a General RM
    • Disable Dropout
    • Train with the SFT Dataset and Shuffle Between Epochs
    • Non End-of-Sequence (EOS) Penalty
    • Advantage Whitening / Normalization
    • Starting from a Weaker Model Can Converge to the Same Verifiable Rewards.
    • OpenRLHF
  • Batch Aggregation: Early during training Tülu 3, we noticed a gap in performance between SFT models trained on our OpenInstruct framework and models trained in other settings such as on TPUs. :padding tokens without taking into account gradient accumulation or distributed training setups

21. Qwen2.5-Omni (03/26/2025)

22. Gemma 3 (03/25/2025)

  • pdf, huggingface

  • alternate between a local sliding window self-attention and global self-attention, with 5 local layers for every global layer, the first layer is local layer.

  • replace the soft-capping of Gemma 2 with QK-norm

  • increase RoPE base frequency from 10k to 1M on global self-attention layers

  • keep the frequency of the local layers at 10k

  • 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B

  • Distillation. We sample 256 logits per token, weighted by teacher probabilities.

23. Phi-4-Mini (03/07/2025)

24. Qwen2.5-VL (02/19/2025)

25. Janus-Pro (01/29/2025)

26. DeepSeek-R1 (01/20/2025)

  • pdf, huggingface

  • DeepSeek-R1-Zero: use DeepSeek-V3-Base as the base model and employ GRPO as the RL framework to improve model performance in reasoning. During training.

  • DeepSeek-R1:

    • (DeepSeek-V3-Base)->(DeepSeek-V3-SFT1) cold-start SFT with thousands of data from in-context long CoT prompting + DeepSeek-R1Zero readable outputs
    • (DeepSeek-V3-SFT1)->(DeepSeek-V3-RL) reasoning-oriented RL like DeepSeek-R1-Zero.
    • (DeepSeek-V3-Base)->(DeepSeek-V3-SFT2) two epoch fine-tuning DeepSeek-V3-Base using 600k reasoning related training samples via rejection sampling on the RL checkpoint + 200k non-reasoning training samples
    • (DeepSeek-V3-SFT2)->(DeepSeek-R1) After fine-tuning, an additional RL process, taking into account prompts from all scenarios.
  • Do not use ORM or PRM, use rule-based reward system: Accuracy rewards, Format rewards.

  • Emphasize that neural reward model may suffer from reward hacking in the large-scale reinforcement learning process

  • Designing a straightforward template that guides the base model to adhere to specified instructions

  • (Interesting) DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This improvement is not the result of external adjustments but rather an intrinsic development within the model.

  • (Interesting) Behaviors such as reflection are not explicitly programmed but instead emerge as a result of the model's interaction with the reinforcement learning environment.

  • Aha Moment of DeepSeek-R1-Zero: DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach.

    • DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing.
    • Distillation from DeepSeek-R1 to smaller dense models works well. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities

27. Kimi K1.5 (01/20/2025)

  • pdf

  • Long-CoT Supervised Fine-Tuning

    • construct a small yet high-quality long-CoT warmup dataset
  • Reinforcement Learning

    • For verifiable problems, the reward is predefined criteria or rules. For problems with free-form ground truth, us a reward model r(x, y, y∗).
    • Length Penalty to avoid overthinking phenomenon
    • Several approaches for this long2short problem, including model merging, shortest rejection sampling, DPO, and long2short RL.

28. MiniMax-01 (01/15/2025)

  • pdf, huggingface minimax-01

  • 456 billion parameters, 45.9 billion activations, and 32 experts, 1.5T tokens for pre-training

  • good to know that the naive linear attention 𝑂=Norm(𝑄(𝐾𝑉)) has efficiency issues due the cumulative sum operation when consider the causal mask

  • Need to learn the detail of Lightning Attention https://sustcsonglin.github.io/assets/pdf/talk_250117.pdf

  • Transformer-style block, with each comprises a channel mixer (an attention block, lightning attention and softmax attention) and a feature mixer (an MLP block, an MoE that incorporates multiple feed-forward networks (FFNs))

  • hybrid architecture have yielded promising results, delve deeper into its potential through two variants: hybrid-cosformer2 and hybrid-hgrn2.

  • Almost perfect long-context understanding ability, with a context window of 1M tokens

29. Qwen2.5-Math-PRM (01/13/2025)

  • pdf, huggingface

  • Commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods.

  • Reveal the potential bias in using response-level BoN evaluation alone for PRMs

  • TBD

30. OLMo 2 (12/31/2024)

  • pdf, huggingface olmo-2

  • up to 5T tokens, 95% derived from web data; 7B 13B parameters

  • Reordered norm and QK-norm. 𝑥+RMSNorm(Attention(𝑥)); out +RMSNorm(MLP(𝑥))

  • Data can be a cause of both gradient norm and loss spikes. When investigating training batches at which spikes occurred, we found a high prevalence of instances containing long, repeated n-gram sequences

  • improving training stability from OLMo 2's initialization, initialize every parameter from a normal distribution with a mean of 0 and a standard deviation of 0.02

  • decreasing the AdamW 𝜀 from 105 to 108

  • confirm the effectiveness of this approach, also known as model souping, on six different mid-training mixes

  • three phases of training: SFT, preference tuning with DPO, and RLVR

  • turn off weight decay for embeddings and observe that embedding norms settle in a healthy region.

31. Deepseek-V3 (12/16/2024)

  • pdf, huggingface

  • multi-token prediction objective, the acceptance rate of 2nd token prediction is 85%   90%

  • knowledge distillation from DeepSeek-R1, notably improves its reasoning performance

  • balanced expert loading introduce a bias term for each expert to help determine the top-K routing

  • DualPipe: overlap the computation and communication within forward and backward chunks.

  • fp8 quantization during training: introduce a fine-grained quantization strategy for fp8

  • an efficient and lightweight training framework, HAI-LLM. (might be the impressive engineering basis)

  • numbers: 14.8T tokens for pre-training

  • RMSNorm recomputation during back-propagation

  • adopt the BF16 for first and second moments in the AdamW

  • do not incorporate cross-sample attention masking during training

  • use document packing method for data integrity

  • incorporate the FIM strategy in the pre-training

  • shared embedding and output head for multi-token prediction (due the DualPipe implementation)

  • not use costly tensor parallelism

  • suggestions on hardware design

    • higher FP8 GEMM accumulation precision
    • tile- and block-wise quantization

      • online quantization
      • transposed GEMM operations

32. Qwen2.5 (12/19/2024)

  • pdf

    • huggingface

    • 0.5B, 1.5B, 3B, 7B, 14B, 72B; 18T token for pre-training

  • Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct generate synthetic data in mathematics, code, and knowledge domains

    • increase RoPE base from 10,000 to 1,000,000 using the ABF technique
    • develop long-response datasets, capable of generating high-quality 8,192 tokens

33. Phi-4 (12/12/2024)

  • pdf

  • numbers: 14B, 10T tokens

  • 50 broad types, 400B-token synthetic datasets, spanning an array of topics, skills, and natures of interaction

  • question-answer data contributed significantly to various capabilities, such as mathematical reasoning and academic performance

  • one round of SFT, one round of DPO on data from our pivotal token search method, and one round of DPO on full length preference pairs

  • 8B tokens of data for SFT, all formatted in the chatml format

34. TÜLU 3 (12/06/2024)

  • pdf, huggingface

  • synthetic data generation for target skills such as precise instruction following, math and coding

  • safety SFT data was generally orthogonal to our other datasets

  • changing the chat template, replacing the newlines at the end of assistant messages with an eos

  • SFT performance noticeably varies based on the seed

  • model soup does not always outperform the best single run

  • use length-normalized DPO for tuning our preference data mixtures and generation methods

  • scaling the number of unique prompts improve downstream DPO performance

  • for our final DPO models we decided on using a learning rate of 2.0×107

  • introduce (RLVR), a novel method for training llm on tasks with verifiable outcomes

  • RLVR focus on two domains (mathematics, exact instruction following) and three evaluations (GSM8K, MATH, IFEval)

35. Llama 3 (08/15/2024)

  • pdf, huggingface

  • 405B parameters on 15.6T tokens using a context window of 8K tokens.

  • supported context window to 128K tokens

  • supervised finetuning on instruction tuning data and Direct Preference Optimization

  • annealing on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks

  • Llama 3 405B is trained on up to 16K H100 GPUs

  • use fully sharded data parallelism (FSDP) for training

  • design a new multi-message chat protocol which uses various special header and termination tokens.

  • average models obtained from experiments using various versions of data or hyperparameters at each RM, SFT, or DPO stage

36. OLMo (07/07/2024)

  • pdf, huggingface olmo, huggingface olmo-2

  • 1B and 7B models, 2T tokens Dolma dataset

  • use up to 256 nodes on this cluster, where each node consists of 4x AMD MI250X GPUs with 128GB of memory5 and 800Gbps of interconnect

  • release model weights, training data and training and evaluation code.