- We introduce a novel approach to enhance model reasoning by using long/short CoT as preferred/rejected examples in DPO.
- Our method demonstrates that carefully curated long/short thinking data pairs can significantly boost mathematical reasoning capabilities.
Overview
Developing models with robust reasoning capabilities remains a key challenge in AI. While supervised fine-tuning (SFT) on extensive slow-thinking datasets is effective, we discovered that a targeted approach using a smaller dataset (5,000 examples) contrasting fast and slow thinking can yield substantial improvements.
This blog presents our methodology, including dataset curation and training details, along with empirical results demonstrating the effectiveness of our approach.
Method
Data Curation
Our approach utilizes two distinct dataset types:
- SFT Datasets:
- OpenO1-SFT:
- Based on
O1-OPEN/OpenO1-SFT
dataset, we remove special tokens (e.g., <Thought>) and reformat final answers into the $\box$ format using GPT-4o-mini. we also filter out non-English entries and sequences longer than 8,192 tokens. The final dataset size is approximately 80,000 examples.
- Based on
- NuminaMath-CoT:
- Randomly select examples from
AI-MO/NuminaMath-CoT
to match the size of the OpenO1-SFT dataset, used for comparison.
- Randomly select examples from
- Sky-T1_data_17k:
- Based on
NovaSky-AI/Sky-T1_data_17k
, we remove special characters and reformat the final answer into QWQ’s output style.
- Based on
- OpenO1-SFT:
- DPO Data:
- Rejected Answers: A random selection of 5,000 examples from
AI-MO/NuminaMath-CoT
, where the dataset’s original answers were treated as the “rejected” entries. - Chosen Answers: New answers for each problem were generated using
Qwen/QwQ-32B-Preview
model. - Filtering: Only entries where both the “chosen” and “rejected” answers were correct were included, ensuring high-quality data.
- Rejected Answers: A random selection of 5,000 examples from
Training
- SFT Training:
- Each SFT dataset is used to train a
LLaMA-3.1-8B
model for one epoch with a learning rate of 3e-5 and a batch size of 32.
- Each SFT dataset is used to train a
- DPO Training:
- Using the 5,000 curated examples, we perform DPO training on the SFT-trained models.
- As a baseline, we also conduct DPO training directly on
LLaMA 3.1 8B-Instruct
model. - All DPO training is with the beta of 0.01 and a batch size of 32.
Results
- We present the results of models trained with different datasets using SFT, followed by DPO training.
Performance on AIME, MATH500, and MATH Overall
Dataset | Training | MATH500 | MATH Overall | Improvement |
---|---|---|---|---|
NuminaMath-CoT | SFT | 0.162 | 0.25 | - |
DPO | 0.176 | 0.26 | +8.64% / +4.00% | |
OpenO1-SFT | SFT | 0.230 | 0.33 | - |
DPO | 0.284 | 0.40 | +23.48% / +21.21% | |
Sky-T1_data_17k | SFT | 0.236 | 0.33 | - |
DPO | 0.252 | 0.36 | +6.78% / +9.09% |
Key Observations:
- DPO training consistently improves performance across all datasets
- OpenO1-SFT shows the most significant gains (+23.48% on MATH500)
- Improvements are maintained across different difficulty levels
MATH Performance by Difficulty Level
To further observe the model’s performance in mathematical reasoning, we collect 100 data points for each level in MATH dataset, and test the model’s scores on datasets from different levels.
Dataset | Training | MATH1 | MATH2 | MATH3 | MATH4 | MATH5 |
---|---|---|---|---|---|---|
NuminaMath-CoT | SFT | 0.45 | 0.31 | 0.19 | 0.09 | 0.05 |
DPO | 0.51 | 0.27 | 0.24 | 0.11 | 0.01 | |
OpenO1-SFT | SFT | 0.58 | 0.40 | 0.31 | 0.20 | 0.10 |
DPO | 0.73 | 0.45 | 0.40 | 0.28 | 0.09 | |
Sky-T1_data_17k | SFT | 0.61 | 0.36 | 0.33 | 0.22 | 0.08 |
DPO | 0.68 | 0.40 | 0.26 | 0.22 | 0.14 |
Notes
- Kimi 1.5 model explored similar approach but focused on using short CoT as preferred examples for inference efficiency.