Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoPE inv_freq code #410

Closed
d-kleine opened this issue Oct 23, 2024 · 8 comments · Fixed by #412
Closed

RoPE inv_freq code #410

d-kleine opened this issue Oct 23, 2024 · 8 comments · Fixed by #412
Assignees
Labels
question Further information is requested

Comments

@d-kleine
Copy link
Contributor

d-kleine commented Oct 23, 2024

I might be wrong, but I think the code for inv_freq for RoPE seems not to be fully correct:

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))

Shouldn't it be divided through all dimensions (head_dim), not just half of them?

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim)))

Sources:

@d-kleine d-kleine added the question Further information is requested label Oct 23, 2024
@rasbt
Copy link
Owner

rasbt commented Oct 23, 2024

Thanks for highlighting this. I remember the RoPE implementation being a bit tricky, and it took me a long time to get it right. in any case, I think the current implementation should be relatively solid. With your update, for example, it wouldn't pass the unit tests (comparison against the Hugging Face implementation) anymore:

def test_rope_llama2(notebook):

I may be overlooking something, or maybe their implementation is wrong too.

@d-kleine
Copy link
Contributor Author

d-kleine commented Oct 23, 2024

True, I will check that and will let you know. I am pretty sure that it should be divided through (head_dim), something is wrong there.

@d-kleine
Copy link
Contributor Author

@rasbt So I just took a look into the issue and here is what I found:

When computing the theta values for Rotary Position Embedding (RoPE), it is typical to use even integers starting from 0 up to head_dim. This approach is used for several reasons related to the mathematical structure and purpose of RoPE:

  1. Dimensional Pairing:

    • RoPE operates by rotating pairs of dimensions within the embedding space. By using even indices, you ensure that each pair of dimensions (e.g., dimensions 0 and 1, 2 and 3, etc.) can be effectively paired for rotation. This pairing is crucial because RoPE uses sine and cosine functions that require two components (real and imaginary parts) to represent complex numbers[1][2].
  2. Sine and Cosine Embeddings:

    • The use of even indices aligns with the way RoPE integrates sine and cosine functions into the positional encoding. Each even index corresponds to a sine component, while the subsequent odd index corresponds to a cosine component. This ensures that each position in the sequence has a unique and smoothly transitioning embedding[2].
  3. Efficiency and Symmetry:

    • Using every second index (i.e., even indices) allows for efficient computation and maintains symmetry in the embedding space. This symmetry is important for preserving the geometric properties of rotations, which are central to how RoPE encodes positional information[1][4].
  4. Complex Number Representation:

    • By using pairs of dimensions, RoPE can effectively represent positions as complex numbers, where each pair acts as the real and imaginary parts. This representation facilitates the encoding of relative positional information through rotations, which is more natural and effective than traditional methods over long sequences[3][4].

Sources:
[1] https://codelabsacademy.com/news/roformer-enhanced-transformer-with-rotary-position-embedding-2024-5-30
[2] https://karthick.ai/blog/2024/Rotatory-Position-Embedding-(RoPE)/
[3] https://ai.plainenglish.io/understanding-llama2-kv-cache-grouped-query-attention-rotary-embedding-and-more-c17e5f49a6d7?gi=5b49ed2bcc1f
[4] https://florianjune.substack.com/p/an-in-depth-exploration-of-rotary-position-embedding-rope-ac351a45c794

The Code therefore should be

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

The tests will pass then too.

This contains two components of an embedding, calculating the corresponding rotation angle for each group, see please here: https://aiexpjourney.substack.com/i/144428516/rope-implementation-in-llama

Please also see this official implementation of Meta: https://github.com/meta-llama/llama/blob/8fac8befd776bc03242fe7bc2236cdb41b6c609c/llama/model.py#L100

@rasbt rasbt mentioned this issue Oct 23, 2024
@rasbt
Copy link
Owner

rasbt commented Oct 23, 2024

Thanks for providing the explanation and code. I am still not clear how those two differ though:

# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))

I think yours is only different if head_dim is not evenly divisible (but that can never happen), but otherwise they should be the same right?

Here's a quick example of what I mean:

import torch

theta_base = 10_000

for head_dim in range(1, 12):

    before = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
    after = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))
    
    s = f"{torch.equal(before, after)} | head dim: {head_dim}, {before}, {after}"
    print(s)
True | head dim: 1, tensor([]), tensor([])
True | head dim: 2, tensor([1.]), tensor([1.])
True | head dim: 3, tensor([1.]), tensor([1.])
True | head dim: 4, tensor([1.0000, 0.0100]), tensor([1.0000, 0.0100])
False | head dim: 5, tensor([1.0000, 0.0100]), tensor([1.0000, 0.0251])
True | head dim: 6, tensor([1.0000, 0.0464, 0.0022]), tensor([1.0000, 0.0464, 0.0022])
False | head dim: 7, tensor([1.0000, 0.0464, 0.0022]), tensor([1.0000, 0.0720, 0.0052])
True | head dim: 8, tensor([1.0000, 0.1000, 0.0100, 0.0010]), tensor([1.0000, 0.1000, 0.0100, 0.0010])
False | head dim: 9, tensor([1.0000, 0.1000, 0.0100, 0.0010]), tensor([1.0000, 0.1292, 0.0167, 0.0022])
True | head dim: 10, tensor([1.0000e+00, 1.5849e-01, 2.5119e-02, 3.9811e-03, 6.3096e-04]), tensor([1.0000e+00, 1.5849e-01, 2.5119e-02, 3.9811e-03, 6.3096e-04])
False | head dim: 11, tensor([1.0000e+00, 1.5849e-01, 2.5119e-02, 3.9811e-03, 6.3096e-04]), tensor([1.0000, 0.1874, 0.0351, 0.0066, 0.0012])

or am I missing something?

In any case thanks for opening the discussion!

@d-kleine
Copy link
Contributor Author

d-kleine commented Oct 23, 2024

Yeah, I think you are right. The first implemenation is slightly better because it provides consistent and expected behavior across both even and odd head_dim values, and it lines up more closely with the formula in the original paper at 3.3 Properties of RoPE:

inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

if theta_base = 10000:

$$10000^{-2i/d} = \frac{1}{10000^{2i/d}}$$

@rasbt
Copy link
Owner

rasbt commented Oct 23, 2024

Although, it's slightly more expensive. But yeah, I am happy to accept that change since it's the main purpose is educational. Thanks!

@d-kleine
Copy link
Contributor Author

d-kleine commented Oct 23, 2024

Yeah, I agree.

Would be happy to have an article about RoPE on AoAI when it suits you in the future. I have read through the paper and several blog posts on this concept, but still could not fully understand it. But it's crucial for understanding nowadays PE encoding.

@rasbt
Copy link
Owner

rasbt commented Oct 23, 2024

It's on the list :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants