Reproduce the inference time scaling exp

In this blog post, I share my reproduction of huggingface blogpost-scaling-test-time-compute. The goal is to show that with more generated tokens, the performance of a smaller model can approach that of a larger model.

1. Takeaways

Answer Extraction: Parsing the final answer out of raw LLM responses is often non-trivial, as different models or prompt formats can wrap the result in extra tokens.
Special Tokens: Be mindful of tokens like <|begin_of_text|> that may appear in outputs for some models.
Smaller Models Benefit More: When we sample multiple solutions, smaller models see a larger relative improvement in accuracy compared to bigger models.
Bigger Models Still Win: Even after scaling smaller models heavily at inference, bigger models can still achieve higher absolute accuracy.
FLOPs Analysis: Realistically, sampling many candidate solutions quickly becomes computationally expensive. Will scaling the test-time computing improve the performance in terms of flops?
The code is available at this github repo.

2. Dataset and model

2.1. dataset

The dataset used in this experiment is HuggingFaceH4/MATH-500. It consists of 500 problems from the MATH benchmark, each containing:

problem: Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$

solution: We have that $r = \sqrt{0^2 + 3^2} = 3.$ Also, if we draw the line connecting the origin and $(0,3),$ this line makes an angle of $\frac{\pi}{2}$ with the positive $x$-axis. [asy] unitsize(0.8 cm); draw((-0.5,0)--(3.5,0)); draw((0,-0.5)--(0,3.5)); draw(arc((0,0),3,0,90),red,Arrow(6)); dot((0,3), red); label("$(0,3)$", (0,3), W); dot((3,0), red); [/asy] Therefore, the polar coordinates are $\boxed{\left( 3, \frac{\pi}{2} \right)}.$

answer: \left( 3, \frac{\pi}{2} \right)

2.2. Large language models

I evaluate two models Llama and Qwen with different sizes:

2.3. Reward model

Llama3.1-8B-PRM-Deepseek-Data

The model is trained from meta-llama/Llama-3.1-8B-Instruct on RLHFlow/Deepseek-PRM-Data for 1 epochs. This model can be used for ORM and PRM. ORM evaluates the final solution, while PRM measures logical correctness at each computation step.

ORM: extract the probability of "+" from the assistant. It represents the outcome reward score for this answer.

[
      {"role": "user", "content": "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. To convert from rectangular coordinates $(x, y)$ to polar coordinates $(r, \\theta)$, we can use the formulas\n\\[r = \\sqrt{x^2 + y^2}\\]\n\\[\\theta = \\arctan \\frac{y}{x}\\]\n\nIn this case, the rectangular coordinates are $(0,3)$, so $x = 0$ and $y = 3$. \n\nFirst, we calculate $r$:\n\\[r = \\sqrt{0^2 + 3^2} = \\sqrt{9} = 3\\]\n\nNext, we calculate $\\theta$:\n\\[\\theta = \\arctan \\frac{3}{0}\\]\nSince the tangent function is not defined for $x = 0$, we need to use a special case. When $x = 0$, $\\theta = \\frac{\\pi}{2}$ if $y > 0$, and $\\theta = \\frac{3\\pi}{2}$ if $y < 0$. In this case, $y = 3 > 0$, so $\\theta = \\frac{\\pi}{2}$.\n\nSo, the polar coordinates equivalent to $(0,3)$ are $\\boxed{(3,\\frac{\\pi}{2})}$."},
      {"role": "assistant", "content": "+"},
]

PRM: computes step-wise reward scores by analyzing each interaction. extract the probability of "+" from the assistant in each turn.

[
      {"role": "user", "content": "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. To convert from rectangular coordinates $(x, y)$ to polar coordinates $(r, \\theta)$, we can use the formulas\n\\[r = \\sqrt{x^2 + y^2}\\]\n\\[\\theta = \\arctan \\frac{y}{x}\\]"},
      {"role": "assistant", "content": "+"},
      {"role": "user", "content": "In this case, the rectangular coordinates are $(0,3)$, so $x = 0$ and $y = 3$."},
      {"role": "assistant", "content": "+"},
      {"role": "user", "content": "In this case, $y = 3 > 0$, so $\\theta = \\frac{\\pi}{2}$."},
      {"role": "assistant", "content": "+"},
      {"role": "user", "content": "So, the polar coordinates equivalent to $(0,3)$ are $\\boxed{(3,\\frac{\\pi}{2})}$."},
      {"role": "assistant", "content": "+"},
]

2.4. Test-time scaling strategies

majority voting
- generate $𝑁$ candidate solutions and pick the most frequent answer
best of $𝑁$ :
- (vanilla) generate $𝑁$ candidates and pick the one with the highest score
- (weighted) generate $𝑁$ candidates and group the indentical answers, then pick the one with the highest score
Beam search:
- [WIP]

3. Reproduce results

3.1. obersevations

for qwen, majority voting and weighted best-of-N achieve similar performance.
scaling test-time computing benefits smaller models more significantly than larger ones.
larger models still outperform smaller ones, even with test-time scaling.

4. Performance improvement in terms of flops?

A natural question: Does scaling the test-time compute yield consistent improvements if we measure actual FLOPs cost rather than just the number of generated tokens?

Different model sizes have different computational demands. Additionally, for inference, the FLOPs for prefill (the forward pass over the prompt) and decoding (token-by-token generation) are quite different. For the PRM approach, there’s an extra overhead of the reward model forward pass. For different size of models, the inference flops may not be liner to the model size. thus I want to see if the performance improvement in terms of flops is consistent with the number of generated tokens.

for majority voting, the total FLOPs is prefill FLOPs + decode FLOPs $\times$ N.
for weighted best-of-N, the total FLOPs is prefill FLOPs + decode FLOPs $\times$ N + prm FLOPs $\times$ N.

where $𝑁$ is the number of samples generated.

4.1. LLM FLOPs estimation

I estimated the FLOPs of the forward pass for prefill and decoding stages as follows. The equation and the anylysis are based on this paper arXiv.

During the following analysis, I use the following notations:

$𝑏$ is the batch size
$𝑠$ is the input sequence length
$ℎ$ is the hidden size
$ℎ^{'}$ is the FFN intermediate size
$𝑛$ is the number of heads
$𝑑$ is the size of each head ( $ℎ = 𝑛 𝑑$ )

For prefill stage, the equations and corresponding FLOPs are:

$𝑸 𝑲 𝑽 = 𝑿 𝑾_{𝑄 𝐾 𝑉}$	$6 𝑏 𝑠 ℎ^{2}$
$𝑸 𝑲 = RoPE (𝑸 𝑲)$	$6 𝑏 𝑠 ℎ$
$𝑶 = Attn (𝑸 𝑲 𝑽)$	$4 𝑏 𝑠^{2} ℎ + 4 𝑏 𝑠^{2} 𝑛$
$𝑿 = 𝑶 𝑾_{𝑂}$	$2 𝑏 𝑠 ℎ^{2}$
$𝑿 = Add&Norm (𝑿)$	$5 𝑏 𝑠 ℎ$
$𝑮 𝑼 = 𝑿 [𝑾_{𝐺}, 𝑾_{𝑈}]$	$4 𝑏 𝑠 ℎ ℎ^{'}$
$𝑫 = Swish (𝑮) 𝑼$	$2 𝑏 𝑠 ℎ^{'}$
$𝑿 = 𝑫 𝑾_{𝐷}$	$2 𝑏 𝑠 ℎ ℎ^{'}$
$𝑿 = Add&Norm (𝑿)$	$5 𝑏 𝑠 ℎ$

For decoding stage, the equations and corresponding FLOPs are:

$(𝑞, 𝑘, 𝑣) = 𝑥 𝑾_{𝑄 𝐾 𝑉}$	$6 𝑏 ℎ^{2}$
$(𝑞, 𝑘) = RoPE (𝑞, 𝑘)$	$6 𝑏 ℎ$
$(𝐾, 𝑉) = Cache (𝑘, 𝑣)$	“-”
$𝑜 = Attn (𝑞, 𝐾, 𝑉)$	$4 𝑏 𝑠 ℎ + 4 𝑏 𝑠 𝑛$
$𝑥 = 𝑜 𝑾_{𝑂}$	$2 𝑏 ℎ^{2}$
$𝑥 = Add&Norm (𝑥)$	$5 𝑏 ℎ$
$(𝑔, 𝑢) = 𝑥 [𝑾_{𝐺}, 𝑾_{𝑈}]$	$4 𝑏 ℎ ℎ^{'}$
$𝑑 = Swish (𝑔) 𝑢$	$2 𝑏 ℎ^{'}$
$𝑥 = 𝑑 𝑾_{𝐷}$	$2 𝑏 ℎ ℎ^{'}$
$𝑥 = Add&Norm (𝑥)$	$5 𝑏 ℎ$

For MATH-500 dataset, The FLOPs of the forward pass can be estimated as follows:

prefill FLOPs = $6 𝑏 𝑠 ℎ^{2} + 6 𝑏 𝑠 ℎ + (4 𝑏 𝑠^{2} ℎ + 4 𝑏 𝑠^{2} 𝑛) + 2 𝑏 𝑠 ℎ^{2} + 5 𝑏 𝑠 ℎ + 4 𝑏 𝑠 ℎ ℎ^{'} + 2 𝑏 𝑠 ℎ^{'} + 2 𝑏 𝑠 ℎ ℎ^{'} + 5 𝑏 𝑠 ℎ$
decoding FLOPs = $6 𝑏 ℎ^{2} + 6 𝑏 ℎ + 4 𝑏 𝑠 ℎ + 4 𝑏 𝑠 𝑛 + 2 𝑏 ℎ^{2} + 5 𝑏 ℎ + 4 𝑏 ℎ ℎ^{'} + 2 𝑏 ℎ^{'} + 2 𝑏 ℎ ℎ^{'} + 5 𝑏 ℎ$

I compute the FLOPs of the forward pass for batch size is $1$ . Then

prefill FLOPs = $8 𝑠 ℎ^{2} + 16 𝑠 ℎ + 4 𝑠^{2} ℎ + 4 𝑠^{2} 𝑛 + 6 𝑠 ℎ ℎ^{'} + 2 𝑠 ℎ^{'}$
decoding FLOPs = $8 ℎ^{2} + 16 ℎ + 4 𝑠 ℎ + 4 𝑠 𝑛 + 6 ℎ ℎ^{'} + 2 ℎ^{'}$

Thus I use the following formula to compute the total FLOPs:

{FLOPs}_{prefill} (𝑠) = 8 𝑠 ℎ^{2} + 16 𝑠 ℎ + 4 𝑠^{2} ℎ + 4 𝑠^{2} 𝑛 + 6 𝑠 ℎ ℎ^{'} + 2 𝑠 ℎ^{'}

{FLOPs}_{decode} (𝑠) = 8 ℎ^{2} + 16 ℎ + 4 𝑠 ℎ + 4 𝑠 𝑛 + 6 ℎ ℎ^{'} + 2 ℎ^{'}

{FLOPs}_{total} = {FLOPs}_{prefill} (𝑝_{𝑙}) + \sum_{𝑖 = 0}^{𝑑_{𝑙} - 1} {FLOPs}_{decode} (𝑝_{𝑙} + 𝑖)

where $𝑝_{𝑙}$ is the length of the problem prompt, and $𝑑_{𝑙}$ is the number of tokens we generate for the solution.

4.2. results

Below, we re-plot the same data—accuracy vs. total FLOPs—for Qwen2.5 of various sizes. The left endpoint of each curve (for majority voting) corresponds to the minimal compute cost of a greedy decoding ( $𝑁 = 1$ ). As the inference time move right, (ideally) smaller models with less flops can achieve similar performance to larger models with more flops.

The results are shown below:

4.3. obersevations

Majority Voting seems to achieve a slightly better cost-to-performance trade-off than Weighted Best-of-N (in some cases). The overhead of scoring each candidate can become significant if $𝑁$ is large.
Scaling for smaller models remains beneficial, but diminishing returns do appear at higher $𝑁$ . If you keep increasing $𝑁$ , you might pay a lot more FLOPs for only marginal accuracy gains.
Larger model vs. scaled smaller model: Even if a smaller model is heavily scaled in test-time compute, a properly sized larger model may still achieve a strictly higher accuracy while also being less or similarly expensive in total FLOPs.

5. Summary

This reproduction reaffirms the main conclusion from the Hugging Face blog post: scaling test-time compute (by sampling multiple solutions and picking the best or majority) can improve accuracy, especially for smaller models. Yet, these improvements don’t entirely overcome the fundamental quality gap between smaller and larger models.

We further demonstrate how analyzing FLOPs clarifies the computational trade-offs in test-time scaling. It’s not always free to sample or evaluate more solutions. Practitioners need to weigh the cost-to-benefit ratio carefully, particularly if they aim to deploy these methods at scale.