Research Article - Xiaotian Han | Academic Insights

RoPE: Rotational Position Embedding

1.Notation
2.Motivations: We Pursue Relative Positional Embedding
3.RoPE (Rotational Position Embedding)
4.RoPE Implementation (Half-and-Half Pairing)
5.Example:

1. Notation

Notation Description
๐’’๐‘š ๐‘š-th query vector without positional information
๐’Œ๐‘› ๐‘›-th key vector without positional information
๐’’๐‘šโ€ฒ ๐‘š-th query vector with positional information
๐’Œ๐‘›โ€ฒ ๐‘›-th key vector with positional information

2. Motivations: We Pursue Relative Positional Embedding

To incorporate positional information into the attention mechanism, we need to transform the original query and key vectors. The functions ๐‘“(โ‹…,โ‹…) encode the position indices ๐‘š and ๐‘› into the query and key vectors respectively, resulting in position-aware representations ๐’’๐‘šโ€ฒ and ๐’Œ๐‘›โ€ฒ.

๐’’๐‘šโ€ฒ=๐‘“(๐’’๐‘š,๐‘š),๐’Œ๐‘›โ€ฒ=๐‘“(๐’Œ๐‘›,๐‘›)

Typically, the attention score between a query at position ๐‘š and a key at position ๐‘› can be represented as a function ๐‘” that depends on both the content vectors (๐’’๐‘š and ๐’Œ๐‘›) and their absolute positions (๐‘š and ๐‘›) as follows:

Attn(๐’’๐‘šโ€ฒ,๐’Œ๐‘›โ€ฒ)=๐‘”(๐’’๐‘š,๐’Œ๐‘›,๐‘š,๐‘›)

but we want the attention score to only depend on the relative position (๐‘šโˆ’๐‘›) rather than absolute positions ๐‘š and ๐‘›, as relative position is easier to generalize to unseen sequence lengths.

Goal of relative positional embedding: Thus our goal is to find a function ๐‘” which is only a function of ๐’’๐‘š, ๐’Œ๐‘›, and ๐‘šโˆ’๐‘›, instead of ๐‘š and ๐‘› themselves as follows:

Attn(๐’’๐‘šโ€ฒ,๐’Œ๐‘›โ€ฒ)=๐‘”(๐’’๐‘š,๐’Œ๐‘›,๐‘šโˆ’๐‘›)

The RoPE is a solution to this goal.

3. RoPE (Rotational Position Embedding)

RoPE is a positional embedding that is a function of the relative position ๐‘šโˆ’๐‘›, which rotates the query and key vectors and then computes the attention score, which is a function of the relative position ๐‘šโˆ’๐‘›.

The base idea of RoPE is to rotate the query and key vectors by a certain angle, thus their dot product is a function of the relative position ๐‘šโˆ’๐‘›. RoPE used the property of rotation matrix multiplication. When we need to rotate a 2-dimensional vector (๐‘ฅ๐‘ฆ) by an angle ๐œƒ, we can multiply a rotation matrix (cos๐œƒโˆ’sin๐œƒsin๐œƒcos๐œƒ) to the vector (๐‘ฅ๐‘ฆ), to get the rotated vector (๐‘ฅcos๐œƒโˆ’๐‘ฆsin๐œƒ๐‘ฅsin๐œƒ+๐‘ฆcos๐œƒ).

In the following, suppose the query and key vectors are 2-dimensional vectors. We can rotate the query and key:

๐’’๐‘šโ€ฒ=๐‘“(๐’’๐‘š,๐‘š)=(cos๐‘š๐œƒโˆ’sin๐‘š๐œƒsin๐‘š๐œƒcos๐‘š๐œƒ)(๐‘ž๐‘š1๐‘ž๐‘š2)๐’Œ๐‘›โ€ฒ=๐‘“(๐’Œ๐‘›,๐‘›)=(cos๐‘›๐œƒโˆ’sin๐‘›๐œƒsin๐‘›๐œƒcos๐‘›๐œƒ)(๐‘˜๐‘›1๐‘˜๐‘›2)

The attention score is then:

Attn(๐’’๐‘šโ€ฒ,๐’Œ๐‘›โ€ฒ)=๐’’๐‘šโ€ฒ๐‘‡โ‹…๐’Œ๐‘›โ€ฒ=[(cos๐‘š๐œƒโˆ’sin๐‘š๐œƒsin๐‘š๐œƒcos๐‘š๐œƒ)(๐‘ž๐‘š1๐‘ž๐‘š2)]๐‘‡โ‹…[(cos๐‘›๐œƒโˆ’sin๐‘›๐œƒsin๐‘›๐œƒcos๐‘›๐œƒ)(๐‘˜๐‘›1๐‘˜๐‘›2)]=(๐‘ž๐‘š1๐‘ž๐‘š2)[(cos๐‘š๐œƒโˆ’sin๐‘š๐œƒsin๐‘š๐œƒcos๐‘š๐œƒ)๐‘‡(cos๐‘›๐œƒโˆ’sin๐‘›๐œƒsin๐‘›๐œƒcos๐‘›๐œƒ)](๐‘˜๐‘›1๐‘˜๐‘›2)=(๐‘ž๐‘š1๐‘ž๐‘š2)[(cos๐‘š๐œƒsin๐‘š๐œƒโˆ’sin๐‘š๐œƒcos๐‘š๐œƒ)(cos๐‘›๐œƒโˆ’sin๐‘›๐œƒsin๐‘›๐œƒcos๐‘›๐œƒ)](๐‘˜๐‘›1๐‘˜๐‘›2)=(๐‘ž๐‘š1๐‘ž๐‘š2)(cos((๐‘›โˆ’๐‘š)๐œƒ)โˆ’sin((๐‘›โˆ’๐‘š)๐œƒ)sin((๐‘›โˆ’๐‘š)๐œƒ)cos((๐‘›โˆ’๐‘š)๐œƒ))(๐‘˜๐‘›1๐‘˜๐‘›2)=๐‘ž๐‘š1[๐‘˜๐‘›1cos((๐‘›โˆ’๐‘š)๐œƒ)โˆ’๐‘˜๐‘›2sin((๐‘›โˆ’๐‘š)๐œƒ)]+๐‘ž๐‘š2[๐‘˜๐‘›1sin((๐‘›โˆ’๐‘š)๐œƒ)+๐‘˜๐‘›2cos((๐‘›โˆ’๐‘š)๐œƒ)]=(๐‘ž๐‘š1๐‘˜๐‘›1+๐‘ž๐‘š2๐‘˜๐‘›2)cos((๐‘›โˆ’๐‘š)๐œƒ)+(๐‘ž๐‘š2๐‘˜๐‘›1โˆ’๐‘ž๐‘š1๐‘˜๐‘›2)sin((๐‘›โˆ’๐‘š)๐œƒ)=๐‘”(๐’’๐‘š,๐’Œ๐‘›,๐‘šโˆ’๐‘›)

thus we have shown that the attention score is a function of ๐’Œ๐‘›, ๐’’๐‘š, and their relative position ๐‘šโˆ’๐‘›, not the absolute positions ๐‘š and ๐‘› themselves.

4. RoPE Implementation (Half-and-Half Pairing)

This is the method used in your Python code and in many popular implementations like LLaMA. It is chosen for its extreme efficiency with vectorized operations. Instead of pairing adjacent dimensions (which is intuitive), this method pairs the first half of the dimensions with the second half.

For a vector ๐’’ with dim=๐‘‘:

  • Pair 0: dimension 0 is paired with dimension ๐‘‘2. (๐‘ž0,๐‘ž๐‘‘2)
  • Pair 1: dimension 1 is paired with dimension ๐‘‘2+1. (๐‘ž1,๐‘ž๐‘‘2+1)
  • โ€ฆ
  • Pair ๐‘‘2โˆ’1: dimension ๐‘‘2โˆ’1 is paired with dimension ๐‘‘โˆ’1. (๐‘ž๐‘‘2โˆ’1,๐‘ž๐‘‘โˆ’1)

In practice, the query and key vectors are not 2-dimensional vectors, but ๐‘‘-dimensional vectors (๐‘‘%2=0).

๐’’๐‘š=(๐‘ž๐‘š1๐‘ž๐‘š2โ‹ฎ๐‘ž๐‘š๐‘‘),๐’Œ๐‘›=(๐‘˜๐‘›1๐‘˜๐‘›2โ‹ฎ๐‘˜๐‘›๐‘‘)

we then group the query and key vectors into ๐‘‘2 pairs, and rotate each pair by a different angle ๐œƒ๐‘– (๐‘–=1,2,โ€ฆ,๐‘‘2). Usually we make the following pairs: [(๐‘ž๐‘š0๐‘ž๐‘š๐‘‘2),(๐‘˜๐‘š0๐‘˜๐‘š๐‘‘2)],[(๐‘ž๐‘š1๐‘ž๐‘š๐‘‘2+1),(๐‘˜๐‘š1๐‘˜๐‘š๐‘‘2+1)],โ€ฆ,[(๐‘ž๐‘š๐‘‘2โˆ’1๐‘ž๐‘š๐‘‘โˆ’1),(๐‘˜๐‘š๐‘‘2โˆ’1๐‘˜๐‘š๐‘‘โˆ’1)].

For each pair ๐‘– (๐‘–=0,1,โ€ฆ,๐‘‘2โˆ’1), we rotate by an angle ๐œƒ๐‘–=1๐œ”2๐‘–๐‘‘=1100002๐‘–๐‘‘, where ๐œ” is a base frequency (typically 10000), ๐‘– is the ๐‘–-th dimension ๐‘–โˆˆ[0,1,2,โ€ฆ,๐‘‘2โˆ’1], and ๐‘‘ is the dimension of the query and key vectors.

This means higher dimensions get rotated by larger angles, creating a spectrum of different frequencies in the positional encoding.

The rotated query vector becomes:

๐’’๐‘šโ€ฒ=(๐‘ž๐‘š1cos(๐œƒ0๐‘š)โˆ’๐‘ž๐‘š๐‘‘2+1sin(๐œƒ0๐‘š)๐‘ž๐‘š2cos(๐œƒ2๐‘‘๐‘š)โˆ’๐‘ž๐‘š๐‘‘2+2sin(๐œƒ2๐‘‘๐‘š)โ‹ฎ๐‘ž๐‘š๐‘‘2cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š)โˆ’๐‘ž๐‘š๐‘‘sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š)๐‘ž๐‘š1sin(๐œƒ0๐‘š)+๐‘ž๐‘š๐‘‘2+1cos(๐œƒ0๐‘š)๐‘ž๐‘š2sin(๐œƒ2๐‘‘๐‘š)+๐‘ž๐‘š๐‘‘2+2cos(๐œƒ2๐‘‘๐‘š)โ‹ฎ๐‘ž๐‘š๐‘‘2sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š)+๐‘ž๐‘š๐‘‘cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š))=(๐‘ž๐‘š1cos(๐œƒ0๐‘š)โˆ’๐‘ž๐‘š๐‘‘2+1sin(๐œƒ0๐‘š)๐‘ž๐‘š2cos(๐œƒ2๐‘‘๐‘š)โˆ’๐‘ž๐‘š๐‘‘2+2sin(๐œƒ2๐‘‘๐‘š)โ‹ฎ๐‘ž๐‘š๐‘‘2cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š)โˆ’๐‘ž๐‘š๐‘‘sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š)๐‘ž๐‘š๐‘‘2+1cos(๐œƒ0๐‘š)+๐‘ž๐‘š1sin(๐œƒ0๐‘š)๐‘ž๐‘š๐‘‘2+2cos(๐œƒ2๐‘‘๐‘š)+๐‘ž๐‘š2sin(๐œƒ2๐‘‘๐‘š)โ‹ฎ๐‘ž๐‘š๐‘‘cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š)+๐‘ž๐‘š๐‘‘2sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘š))

Similarly for the key vector:

๐’Œ๐‘›โ€ฒ=(๐‘˜๐‘›1cos(๐œƒ0๐‘›)โˆ’๐‘˜๐‘›๐‘‘2+1sin(๐œƒ0๐‘›)๐‘˜๐‘›2cos(๐œƒ2๐‘‘๐‘›)โˆ’๐‘˜๐‘›๐‘‘2+2sin(๐œƒ2๐‘‘๐‘›)โ‹ฎ๐‘˜๐‘›๐‘‘2cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›)โˆ’๐‘˜๐‘›๐‘‘sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›)๐‘˜๐‘›1sin(๐œƒ0๐‘›)+๐‘˜๐‘›๐‘‘2+1cos(๐œƒ0๐‘›)๐‘˜๐‘›2sin(๐œƒ2๐‘‘๐‘›)+๐‘˜๐‘›๐‘‘2+2cos(๐œƒ2๐‘‘๐‘›)โ‹ฎ๐‘˜๐‘›๐‘‘2sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›)+๐‘˜๐‘›๐‘‘cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›))=(๐‘˜๐‘›1cos(๐œƒ0๐‘›)โˆ’๐‘˜๐‘›๐‘‘2+1sin(๐œƒ0๐‘›)๐‘˜๐‘›2cos(๐œƒ2๐‘‘๐‘›)โˆ’๐‘˜๐‘›๐‘‘2+2sin(๐œƒ2๐‘‘๐‘›)โ‹ฎ๐‘˜๐‘›๐‘‘2cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›)โˆ’๐‘˜๐‘›๐‘‘sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›)๐‘˜๐‘›๐‘‘2+1cos(๐œƒ0๐‘›)+๐‘˜๐‘›1sin(๐œƒ0๐‘›)๐‘˜๐‘›๐‘‘2+2cos(๐œƒ2๐‘‘๐‘›)+๐‘˜๐‘›2sin(๐œƒ2๐‘‘๐‘›)โ‹ฎ๐‘˜๐‘›๐‘‘cos(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›)+๐‘˜๐‘›๐‘‘2sin(๐œƒ๐‘‘โˆ’2๐‘‘๐‘›))

When we compute the attention score between these rotated vectors, each pair contributes a term that depends on the relative position (๐‘šโˆ’๐‘›), similar to the 2D case we analyzed earlier. The different frequencies ๐œ”2๐‘–๐‘‘ allow the model to capture position-dependent patterns at different scales.

Thus we have:

๐’’๐‘šโ€ฒ๐‘‡โ‹…๐’Œ๐‘›โ€ฒ=[(๐‘ž๐‘š1๐‘˜๐‘›1+๐‘ž๐‘š๐‘‘2๐‘˜๐‘›๐‘‘2)cos((๐‘›โˆ’๐‘š)๐œƒ)+(๐‘ž๐‘š๐‘‘2๐‘˜๐‘›1โˆ’๐‘ž๐‘š1๐‘˜๐‘›๐‘‘2)sin((๐‘›โˆ’๐‘š)๐œƒ)]+โ€ฆ+[(๐‘ž๐‘š๐‘‘2โˆ’1๐‘˜๐‘›๐‘‘2โˆ’1+๐‘ž๐‘š๐‘‘โˆ’1๐‘˜๐‘›๐‘‘โˆ’1)cos((๐‘›โˆ’๐‘š)๐œƒ)+(๐‘ž๐‘š๐‘‘โˆ’1๐‘˜๐‘›๐‘‘2โˆ’1โˆ’๐‘ž๐‘š๐‘‘2โˆ’1๐‘˜๐‘›๐‘‘โˆ’1)sin((๐‘›โˆ’๐‘š)๐œƒ)]=๐‘”(๐’’๐‘š,๐’Œ๐‘›,๐‘šโˆ’๐‘›)

5. Example:

Let's walk through how the code achieves this with a concrete example where dim =8. The input query vector is ๐’’=[๐‘ž0,๐‘ž1,๐‘ž2,๐‘ž3,๐‘ž4,๐‘ž5,๐‘ž6,๐‘ž7].

Pairing: The pairs will be: (๐‘ž0,๐‘ž4),(๐‘ž1,๐‘ž5),(๐‘ž2,๐‘ž6),(๐‘ž3,๐‘ž7).

Frequency: There will be ๐‘‘2=4 unique angles for a given position ๐‘š: ๐œƒ0,๐œƒ1,๐œƒ2,๐œƒ3. These angles are calculated based on the position ๐‘š and the pair index ๐‘–โˆˆ[0,1,2,3]:

๐œƒ๐‘š๐‘–=1100002๐‘–๐‘‘โ‹…๐‘š

where ๐‘‘ is the dimension. For example, ๐œƒ๐‘š0=1100002โˆ—08โ‹…๐‘š=1100000โ‹…๐‘š=1โ‹…๐‘š=๐‘š

We construct the angles for a given position ๐‘š as [๐œƒ๐‘š0,๐œƒ๐‘š1,๐œƒ๐‘š2,๐œƒ๐‘š3,๐œƒ๐‘š0,๐œƒ๐‘š1,๐œƒ๐‘š2,๐œƒ๐‘š3].

The final cos and sin tensors will therefore have this duplicated structure. For position m:

cos๐‘š=[cos(๐œƒ๐‘š0),cos(๐œƒ๐‘š1),cos(๐œƒ๐‘š2),cos(๐œƒ๐‘š3),cos(๐œƒ๐‘š0),cos(๐œƒ๐‘š1),cos(๐œƒ๐‘š2),cos(๐œƒ๐‘š3)]

sin๐‘š=[sin(๐œƒ๐‘š0),sin(๐œƒ๐‘š1),sin(๐œƒ๐‘š2),sin(๐œƒ๐‘š3),sin(๐œƒ๐‘š0),sin(๐œƒ๐‘š1),sin(๐œƒ๐‘š2),sin(๐œƒ๐‘š3)]

Now, let's see how the rotation is applied to ๐’’=[๐‘ž0,๐‘ž1,๐‘ž2,๐‘ž3,๐‘ž4,๐‘ž5,๐‘ž6,๐‘ž7].

We construct half-rotated query vector as ๐’’rotatedย =[โˆ’๐‘ž4,โˆ’๐‘ž5,โˆ’๐‘ž6,โˆ’๐‘ž7,๐‘ž0,๐‘ž1,๐‘ž2,๐‘ž3].

We then construct the query vector ๐’’๐‘šโ€ฒ as:

๐’’๐‘šโ€ฒ=(๐’’โŠ™cos)+(๐’’rotatedย โŠ™sin)

If we let our 2D vector be (๐‘ž0,๐‘ž4) and the rotation angle be ๐œƒ๐‘š0, the standard 2D rotation formulas are:

  • ๐‘ž0โ€ฒ=๐‘ž0cos(๐œƒ๐‘š0)โˆ’๐‘ž4sin(๐œƒ๐‘š0)
  • ๐‘ž4โ€ฒ=๐‘ž0sin(๐œƒ๐‘š0)+๐‘ž4cos(๐œƒ๐‘š0)

As you can see, the code perfectly implements the 2D rotation for the pair (๐‘ž0,๐‘ž4). This same logic applies simultaneously to all other pairs: (๐‘ž1,๐‘ž5), (๐‘ž2,๐‘ž6), (๐‘ž3,๐‘ž7).

  • ๐‘ž1โ€ฒ=๐‘ž1cos(๐œƒ๐‘š1)โˆ’๐‘ž5sin(๐œƒ๐‘š1)
  • ๐‘ž5โ€ฒ=๐‘ž1sin(๐œƒ๐‘š1)+๐‘ž5cos(๐œƒ๐‘š1)
  • ๐‘ž2โ€ฒ=๐‘ž2cos(๐œƒ๐‘š2)โˆ’๐‘ž6sin(๐œƒ๐‘š2)
  • ๐‘ž6โ€ฒ=๐‘ž2sin(๐œƒ๐‘š2)+๐‘ž6cos(๐œƒ๐‘š2)
  • ๐‘ž3โ€ฒ=๐‘ž3cos(๐œƒ๐‘š3)โˆ’๐‘ž7sin(๐œƒ๐‘š3)
  • ๐‘ž7โ€ฒ=๐‘ž3sin(๐œƒ๐‘š3)+๐‘ž7cos(๐œƒ๐‘š3)