Cross-entropy loss and its optimization [WIP]

1. Background

Computing cross-entropy loss becomes significantly more challenging for LLMs. This is primarily due to the extremely large logit and label matrices involved in the calculations, which can lead to high computational costs and memory usage. Recently, several optimization strategies have been proposed to address this issue, starting from a Pytorch GitHub issue.

https://github.com/pytorch/pytorch/issues/124480
https://github.com/mgmalek/efficient_cross_entropy
Liger Kernel: github, arxiv
Cut Your Losses in Large-Vocabulary Language Models: arxiv

All these approaches share a common goal: avoiding the full materialization of the logit matrix. They achieve this by:

chunking the logit matrix
computating the gradient of logit in place

In this blog, I will dive into the cross entropy loss and its optimization strategies.

2. Softmax Cross-Entropy

2.1. Forward Pass

Let’s begin by understanding the forward pass of the cross-entropy loss.

Consider:

An input vector $𝒙 \in ℝ^{𝑑}$ representing the logits (unnormalized scores) produced by the model for each class.
A true label $𝑦 \in {0, 1, \dots, 𝑑 - 1}$ indicating the correct class.

The softmax function converts the logits into probabilities:

𝒑_{𝑖} = \frac{𝑒^{𝒙_{𝑖}}}{\sum_{𝑘 = 1}^{𝑑} 𝑒^{𝒙_{𝑘}}}

Here, $𝒑_{𝑖}$ represents the probability of the input belonging to class $𝑖$ .

The cross-entropy loss for a single instance is then defined as:

𝐿 = - \log (𝒑_{𝑦})

Expanding this, we get:

𝐿 = - \log (𝒑_{𝑦}) = - \log (\frac{𝑒^{𝒙_{𝑦}}}{\sum_{𝑘 = 1}^{𝑑} 𝑒^{𝒙_{𝑘}}}) = - \log (𝑒^{𝒙_{𝑦}}) + \log (\sum_{𝑘 = 1}^{𝑑} 𝑒^{𝒙_{𝑘}}) = - 𝒙_{𝑦} + \log (\sum_{𝑘 = 1}^{𝑑} 𝑒^{𝒙_{𝑘}})

2.2. Backward Pass

In general, the gradient of the loss with respect to the input is given by

\frac{\partial 𝐿}{\partial 𝒛_{𝑖}} = \frac{\partial 𝐿}{\partial 𝒑_{𝑗}} \frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}}

2.2.1. Step 1: Compute $\frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}}$

The result is:

\frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}} = {\begin{matrix} 𝒑_{𝑗} (1 - 𝒑_{𝑗}) \\ if 𝑗 = 𝑖 \\ - 𝒑_{𝑗} 𝒑_{𝑖} \\ if 𝑗 \neq 𝑖 \end{matrix}

The full derivation for the case $𝑗 = 𝑖$ is:

\begin{matrix} \frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑗}} & = \frac{\partial (\frac{𝑒^{𝒛_{𝑗}}}{\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}}})}{\partial 𝒛_{𝑗}} \\ = \frac{(\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}}) \cdot 𝑒^{𝒛_{𝑗}} - 𝑒^{𝒛_{𝑗}} 𝑒^{𝒛_{𝑗}}}{{(\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}})}^{2}} \\ = (\frac{𝑒^{𝒛_{𝑗}}}{\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}}}) (1 - \frac{𝑒^{𝒛_{𝑗}}}{\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}}}) \\ = 𝒑_{𝑗} (1 - 𝒑_{𝑗}) \end{matrix}

And for $𝑗 \neq 𝑖$ :

\begin{matrix} \frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}} & = \frac{\partial (\frac{𝑒^{𝒛_{𝑗}}}{\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}}})}{\partial 𝒛_{𝑖}} \\ = \frac{- 𝑒^{𝒛_{𝑗}} \cdot 𝑒^{𝒛_{𝑖}}}{{(\sum_{𝑘 = 1}^{𝑁} 𝑒^{𝒛_{𝑘}})}^{2}} \\ = - 𝒑_{𝑗} 𝒑_{𝑖} \end{matrix}

2.2.2. Step 2: Compute $\frac{\partial 𝐿}{\partial 𝒛_{𝑖}}$

\begin{matrix} \frac{\partial 𝐿}{\partial 𝒛_{𝑖}} & = \sum_{𝑗 = 1}^{𝑁} \frac{\partial (- 𝒕_{𝑗} \log 𝒑_{𝑗})}{\partial 𝒛_{𝑖}} \\ = - \sum_{𝑗 = 1}^{𝑁} 𝒕_{𝑗} \frac{\partial (\log 𝒑_{𝑗})}{\partial 𝒛_{𝑖}} \\ = - \sum_{𝑗 = 1}^{𝑁} 𝒕_{𝑗} \frac{1}{𝒑_{𝑗}} \frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}} \\ = - \frac{𝒕_{𝑖}}{𝒑_{𝑖}} \frac{\partial 𝒑_{𝑖}}{\partial 𝒛_{𝑖}} - \sum_{𝑗 = 1, 𝑗 \neq 𝑖}^{𝑁} \frac{𝒕_{𝑗}}{𝒑_{𝑗}} \frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}} \\ = - \frac{𝒕_{𝑖}}{𝒑_{𝑖}} 𝒑_{𝑖} (1 - 𝒑_{𝑖}) - \sum_{𝑗 = 1, 𝑗 \neq 𝑖}^{𝑁} \frac{𝒕_{𝑗}}{𝒑_{𝑗}} (- 𝒑_{𝑗} 𝒑_{𝑖}) \\ = - 𝒕_{𝑖} + 𝒕_{𝑖} 𝒑_{𝑖} + \sum_{𝑗 = 1, 𝑗 \neq 𝑖}^{𝑁} 𝒕_{𝑗} 𝒑_{𝑖} \\ = - 𝒕_{𝑖} + \sum_{𝑗 = 1}^{𝑁} 𝒕_{𝑗} 𝒑_{𝑖} \\ = - 𝒕_{𝑖} + 𝒑_{𝑖} \sum_{𝑗 = 1}^{𝑁} 𝒕_{𝑗} \\ = - 𝒕_{𝑖} + 𝒑_{𝑖} \\ = 𝒑_{𝑖} - 𝒕_{𝑖} \end{matrix}

So,

\frac{\partial 𝐿}{\partial 𝒛} = 𝒑 - 𝒕

2.3. Gradient in Matrix Form

For batch computations, it’s efficient to represent gradients in matrix form.

Given:

$𝑷 \in ℝ^{𝑛 \times 𝑑}$ : Matrix of predicted probabilities for a batch of size $𝑛$ .
$𝒁 \in ℝ^{𝑛 \times 𝑑}$ : Matrix of logits.
$𝒀 \in ℝ^{𝑛 \times 𝑑}$ : One-hot encoded true labels.

The gradient with respect to the logits is:

\frac{\partial 𝑷_{𝑖, 𝑗}}{\partial 𝒁_{𝑖, 𝑘}} = 𝑷_{𝑖, 𝑗} (𝛿_{𝑗, 𝑘} - 𝑷_{𝑖, 𝑘})

\frac{\partial 𝐿}{\partial 𝒁} = 𝑷 - 𝒀

Normalized by batch size, the overall gradient of the loss is:

\frac{\partial 𝐿}{\partial 𝒁} = \frac{1}{𝑛} (𝑷 - 𝒀)

3. Linear-Softmax-Cross-Entropy

Cross-entropy loss is typically preceded by a linear (fully connected) layer and followed by a softmax activation. If we can fuse the linear layer and softmax activation, we may avoid the full materialization of the logit matrix.

Input before the final linear layer: $𝑿 \in ℝ^{𝑛 \times 𝑑_{in}}$
Linear weights: $𝑾 \in ℝ^{𝑑_{in} \times 𝑑_{out}}$
Linear bias: $𝒃 \in ℝ^{𝑑_{out}}$
Labels: $𝑦 \in {0, 1, \dots, 𝑛 - 1}$ , representing the true classes for each instance in the batch.

3.1. Forward Pass

With a linear transformation, the input $𝑿$ is transformed linearly using the weights and bias:

𝒁 = 𝑿 𝑾 + 𝒃

Softmax:

𝑷_{𝑖, 𝑗} = \frac{𝑒^{𝒁_{𝑖, 𝑗}}}{\sum_{𝑘 = 1}^{𝑑_{out}} 𝑒^{𝒁_{𝑖, 𝑘}}}

Cross-entropy loss is computed for each instance and then averaged over the batch:

𝐿_{𝑖} = - \log (𝑷_{𝑖, 𝑦_{𝑖}})

𝐿 = \frac{1}{𝑛} \sum_{𝑖 = 1}^{𝑛} 𝐿_{𝑖}

3.2. Backward Pass

Gradient of $𝒁$ :

\frac{\partial 𝐿}{\partial 𝒁} = \frac{1}{𝑛} (𝑷 - 𝒀)

Gradient of $𝑾$ :

\frac{\partial 𝐿}{\partial 𝑾} = 𝑿^{𝑇} \frac{\partial 𝐿}{\partial 𝒁}

Gradient of $𝒃$ :

\frac{\partial 𝐿}{\partial 𝒃} = \sum_{𝑖 = 1}^{𝑛} \frac{\partial 𝐿}{\partial 𝒁_{𝑖}}

Gradient of input $𝑿$ :

\frac{\partial 𝐿}{\partial 𝑿} = \frac{\partial 𝐿}{\partial 𝒁} 𝑾^{𝑇}

3.3. Summary of Gradients

Parameter	Formula	Dimensions
$𝒁$	$𝒁 = 𝑿 𝑾 + 𝒃$	$[𝑛, 𝑑_{out}]$
$𝑷$	$𝑷 = softmax (𝒁)$	$[𝑛, 𝑑_{out}]$
$𝐿$	$𝐿 = - \frac{1}{𝑛} \sum \log (𝑷_{𝑖, 𝑦_{𝑖}})$	$Scalar$
$𝑑 𝒁$	$𝑑 𝒁 = \frac{1}{𝑛 (𝑷 - 𝒀)}$	$[𝑛, 𝑑_{out}]$
$𝑑 𝑾$	$𝑑 𝑾 = 𝑿^{𝑇} 𝑑 𝒁$	$[𝑑_{in}, 𝑑_{out}]$
$𝑑 𝒃$	$𝑑 𝒃 = sum (𝑑 𝒁)$	$[𝑑_{out}]$
$𝑑 𝑿$	$𝑑 𝑿 = 𝑑 𝒁 𝑾^{𝑇}$	$[𝑛, 𝑑_{in}]$

4. Optimization Strategies

Chunking the logit matrix: Chunking over the batch can avoid materializing the full logit matrix. The logit matrix is divided into chunks over the batch size dimension, and the cross-entropy loss is computed for each chunk. The final loss is the sum of the losses of all chunks.
Compute the gradient of logit in place: The gradient of the logit matrix is computed in place, and the gradient of the input is computed by multiplying the gradient of the logit matrix with the weight matrix.

Cross-entropy loss and its optimization [WIP]

1. Background

2. Softmax Cross-Entropy

2.1. Forward Pass

2.2. Backward Pass

2.2.1. Step 1: Compute $\frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}}$

2.2.2. Step 2: Compute $\frac{\partial 𝐿}{\partial 𝒛_{𝑖}}$

2.3. Gradient in Matrix Form

3. Linear-Softmax-Cross-Entropy

3.1. Forward Pass

3.2. Backward Pass

3.3. Summary of Gradients

4. Optimization Strategies

4.1. efficient_cross_entropy

4.2. liger kernel

4.3. cut your losses in large-vocabulary language models

Cross-entropy loss and its optimization [WIP]

1. Background

2. Softmax Cross-Entropy

2.1. Forward Pass

2.2. Backward Pass

2.2.1. Step 1: Compute ∂𝒑𝑗∂𝒛𝑖

2.2.2. Step 2: Compute ∂𝐿∂𝒛𝑖

2.3. Gradient in Matrix Form

3. Linear-Softmax-Cross-Entropy

3.1. Forward Pass

3.2. Backward Pass

3.3. Summary of Gradients

4. Optimization Strategies

4.1. efficient_cross_entropy

4.2. liger kernel

4.3. cut your losses in large-vocabulary language models

2.2.1. Step 1: Compute $\frac{\partial 𝒑_{𝑗}}{\partial 𝒛_{𝑖}}$

2.2.2. Step 2: Compute $\frac{\partial 𝐿}{\partial 𝒛_{𝑖}}$