1.Notation
2.Motivations: We Pursue Relative Positional Embedding
3.RoPE (Rotational Position Embedding)
4.RoPE Implementation (Half-and-Half Pairing)
5.Example:
1. Notation
Notation |
Description |
|
-th query vector without positional information |
|
-th key vector without positional information |
|
-th query vector with positional information |
|
-th key vector with positional information |
2. Motivations: We Pursue Relative Positional Embedding
To incorporate positional information into the attention mechanism, we need to transform the original query and key vectors. The functions encode the position indices and into the query and key vectors respectively, resulting in position-aware representations and .
Typically, the attention score between a query at position and a key at position can be represented as a function that depends on both the content vectors ( and ) and their absolute positions ( and ) as follows:
but we want the attention score to only depend on the relative position () rather than absolute positions and , as relative position is easier to generalize to unseen sequence lengths.
Goal of relative positional embedding: Thus our goal is to find a function which is only a function of , , and , instead of and themselves as follows:
The RoPE is a solution to this goal.
3. RoPE (Rotational Position Embedding)
RoPE is a positional embedding that is a function of the relative position , which rotates the query and key vectors and then computes the attention score, which is a function of the relative position .
The base idea of RoPE is to rotate the query and key vectors by a certain angle, thus their dot product is a function of the relative position . RoPE used the property of rotation matrix multiplication. When we need to rotate a 2-dimensional vector by an angle , we can multiply a rotation matrix to the vector , to get the rotated vector .
In the following, suppose the query and key vectors are 2-dimensional vectors. We can rotate the query and key:
The attention score is then:
thus we have shown that the attention score is a function of , , and their relative position , not the absolute positions and themselves.
4. RoPE Implementation (Half-and-Half Pairing)
This is the method used in your Python code and in many popular implementations like LLaMA. It is chosen for its extreme efficiency with vectorized operations. Instead of pairing adjacent dimensions (which is intuitive), this method pairs the first half of the dimensions with the second half.
For a vector with dim:
- Pair 0: dimension is paired with dimension .
- Pair 1: dimension is paired with dimension .
- โฆ
- Pair : dimension is paired with dimension .
In practice, the query and key vectors are not 2-dimensional vectors, but -dimensional vectors ().
we then group the query and key vectors into pairs, and rotate each pair by a different angle (). Usually we make the following pairs: .
For each pair (), we rotate by an angle , where is a base frequency (typically ), is the -th dimension , and is the dimension of the query and key vectors.
This means higher dimensions get rotated by larger angles, creating a spectrum of different frequencies in the positional encoding.
The rotated query vector becomes:
Similarly for the key vector:
When we compute the attention score between these rotated vectors, each pair contributes a term that depends on the relative position , similar to the 2D case we analyzed earlier. The different frequencies allow the model to capture position-dependent patterns at different scales.
Thus we have:
5. Example:
Let's walk through how the code achieves this with a concrete example where dim . The input query vector is .
Pairing: The pairs will be: .
Frequency: There will be unique angles for a given position : . These angles are calculated based on the position and the pair index :
where is the dimension. For example,
We construct the angles for a given position as .
The final cos
and sin
tensors will therefore have this duplicated structure. For position m
:
Now, let's see how the rotation is applied to .
We construct half-rotated query vector as .
We then construct the query vector as:
If we let our 2D vector be and the rotation angle be , the standard 2D rotation formulas are:
As you can see, the code perfectly implements the 2D rotation for the pair . This same logic applies simultaneously to all other pairs: , , .