RoPE `inv_freq` code #410

d-kleine · 2024-10-23T12:42:26Z

I might be wrong, but I think the code for inv_freq for RoPE seems not to be fully correct:

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))

Shouldn't it be divided through all dimensions (head_dim), not just half of them?

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim)))

Sources:

The text was updated successfully, but these errors were encountered:

rasbt · 2024-10-23T12:47:30Z

Thanks for highlighting this. I remember the RoPE implementation being a bit tricky, and it took me a long time to get it right. in any case, I think the current implementation should be relatively solid. With your update, for example, it wouldn't pass the unit tests (comparison against the Hugging Face implementation) anymore:

LLMs-from-scratch/ch05/07_gpt_to_llama/tests/tests.py

Line 67 in ef40181

def test_rope_llama2(notebook):

I may be overlooking something, or maybe their implementation is wrong too.

d-kleine · 2024-10-23T13:12:29Z

True, I will check that and will let you know. I am pretty sure that it should be divided through (head_dim), something is wrong there.

d-kleine · 2024-10-23T14:56:18Z

@rasbt So I just took a look into the issue and here is what I found:

When computing the theta values for Rotary Position Embedding (RoPE), it is typical to use even integers starting from 0 up to head_dim. This approach is used for several reasons related to the mathematical structure and purpose of RoPE:

Dimensional Pairing:
- RoPE operates by rotating pairs of dimensions within the embedding space. By using even indices, you ensure that each pair of dimensions (e.g., dimensions 0 and 1, 2 and 3, etc.) can be effectively paired for rotation. This pairing is crucial because RoPE uses sine and cosine functions that require two components (real and imaginary parts) to represent complex numbers[1][2].
Sine and Cosine Embeddings:
- The use of even indices aligns with the way RoPE integrates sine and cosine functions into the positional encoding. Each even index corresponds to a sine component, while the subsequent odd index corresponds to a cosine component. This ensures that each position in the sequence has a unique and smoothly transitioning embedding[2].
Efficiency and Symmetry:
- Using every second index (i.e., even indices) allows for efficient computation and maintains symmetry in the embedding space. This symmetry is important for preserving the geometric properties of rotations, which are central to how RoPE encodes positional information[1][4].
Complex Number Representation:
- By using pairs of dimensions, RoPE can effectively represent positions as complex numbers, where each pair acts as the real and imaginary parts. This representation facilitates the encoding of relative positional information through rotations, which is more natural and effective than traditional methods over long sequences[3][4].

Sources:
[1] https://codelabsacademy.com/news/roformer-enhanced-transformer-with-rotary-position-embedding-2024-5-30
[2] https://karthick.ai/blog/2024/Rotatory-Position-Embedding-(RoPE)/
[3] https://ai.plainenglish.io/understanding-llama2-kv-cache-grouped-query-attention-rotary-embedding-and-more-c17e5f49a6d7?gi=5b49ed2bcc1f
[4] https://florianjune.substack.com/p/an-in-depth-exploration-of-rotary-position-embedding-rope-ac351a45c794

The Code therefore should be

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

The tests will pass then too.

This contains two components of an embedding, calculating the corresponding rotation angle for each group, see please here: https://aiexpjourney.substack.com/i/144428516/rope-implementation-in-llama

Please also see this official implementation of Meta: https://github.com/meta-llama/llama/blob/8fac8befd776bc03242fe7bc2236cdb41b6c609c/llama/model.py#L100

rasbt · 2024-10-23T22:08:41Z

Thanks for providing the explanation and code. I am still not clear how those two differ though:

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))

I think yours is only different if head_dim is not evenly divisible (but that can never happen), but otherwise they should be the same right?

Here's a quick example of what I mean:

import torch

theta_base = 10_000

for head_dim in range(1, 12):

    before = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
    after = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))
    
    s = f"{torch.equal(before, after)} | head dim: {head_dim}, {before}, {after}"
    print(s)

True | head dim: 1, tensor([]), tensor([])
True | head dim: 2, tensor([1.]), tensor([1.])
True | head dim: 3, tensor([1.]), tensor([1.])
True | head dim: 4, tensor([1.0000, 0.0100]), tensor([1.0000, 0.0100])
False | head dim: 5, tensor([1.0000, 0.0100]), tensor([1.0000, 0.0251])
True | head dim: 6, tensor([1.0000, 0.0464, 0.0022]), tensor([1.0000, 0.0464, 0.0022])
False | head dim: 7, tensor([1.0000, 0.0464, 0.0022]), tensor([1.0000, 0.0720, 0.0052])
True | head dim: 8, tensor([1.0000, 0.1000, 0.0100, 0.0010]), tensor([1.0000, 0.1000, 0.0100, 0.0010])
False | head dim: 9, tensor([1.0000, 0.1000, 0.0100, 0.0010]), tensor([1.0000, 0.1292, 0.0167, 0.0022])
True | head dim: 10, tensor([1.0000e+00, 1.5849e-01, 2.5119e-02, 3.9811e-03, 6.3096e-04]), tensor([1.0000e+00, 1.5849e-01, 2.5119e-02, 3.9811e-03, 6.3096e-04])
False | head dim: 11, tensor([1.0000e+00, 1.5849e-01, 2.5119e-02, 3.9811e-03, 6.3096e-04]), tensor([1.0000, 0.1874, 0.0351, 0.0066, 0.0012])

or am I missing something?

In any case thanks for opening the discussion!

d-kleine · 2024-10-23T23:05:46Z

Yeah, I think you are right. The first implemenation is slightly better because it provides consistent and expected behavior across both even and odd head_dim values, and it lines up more closely with the formula in the original paper at 3.3 Properties of RoPE:

inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

if theta_base = 10000:

$$10000^{-2i/d} = \frac{1}{10000^{2i/d}}$$

rasbt · 2024-10-23T23:07:33Z

Although, it's slightly more expensive. But yeah, I am happy to accept that change since it's the main purpose is educational. Thanks!

d-kleine · 2024-10-23T23:18:16Z

Yeah, I agree.

Would be happy to have an article about RoPE on AoAI when it suits you in the future. I have read through the paper and several blog posts on this concept, but still could not fully understand it. But it's crucial for understanding nowadays PE encoding.

rasbt · 2024-10-23T23:26:46Z

It's on the list :)

d-kleine added the question Further information is requested label Oct 23, 2024

d-kleine assigned rasbt Oct 23, 2024

rasbt mentioned this issue Oct 23, 2024

RoPE updates #412

Merged

rasbt closed this as completed in #412 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE `inv_freq` code #410

RoPE `inv_freq` code #410

d-kleine commented Oct 23, 2024 •

edited

Loading

rasbt commented Oct 23, 2024

d-kleine commented Oct 23, 2024 •

edited

Loading

d-kleine commented Oct 23, 2024

rasbt commented Oct 23, 2024 •

edited

Loading

d-kleine commented Oct 23, 2024 •

edited

Loading

rasbt commented Oct 23, 2024

d-kleine commented Oct 23, 2024 •

edited

Loading

rasbt commented Oct 23, 2024

RoPE inv_freq code #410

RoPE inv_freq code #410

Comments

d-kleine commented Oct 23, 2024 • edited Loading

rasbt commented Oct 23, 2024

d-kleine commented Oct 23, 2024 • edited Loading

d-kleine commented Oct 23, 2024

rasbt commented Oct 23, 2024 • edited Loading

d-kleine commented Oct 23, 2024 • edited Loading

rasbt commented Oct 23, 2024

d-kleine commented Oct 23, 2024 • edited Loading

rasbt commented Oct 23, 2024

RoPE `inv_freq` code #410

RoPE `inv_freq` code #410

d-kleine commented Oct 23, 2024 •

edited

Loading

d-kleine commented Oct 23, 2024 •

edited

Loading

rasbt commented Oct 23, 2024 •

edited

Loading

d-kleine commented Oct 23, 2024 •

edited

Loading

d-kleine commented Oct 23, 2024 •

edited

Loading