-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoPE inv_freq
code
#410
Comments
Thanks for highlighting this. I remember the RoPE implementation being a bit tricky, and it took me a long time to get it right. in any case, I think the current implementation should be relatively solid. With your update, for example, it wouldn't pass the unit tests (comparison against the Hugging Face implementation) anymore:
I may be overlooking something, or maybe their implementation is wrong too. |
True, I will check that and will let you know. I am pretty sure that it should be divided through |
@rasbt So I just took a look into the issue and here is what I found: When computing the theta values for Rotary Position Embedding (RoPE), it is typical to use even integers starting from 0 up to
Sources: The Code therefore should be # Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim)) The tests will pass then too. This contains two components of an embedding, calculating the corresponding rotation angle for each group, see please here: https://aiexpjourney.substack.com/i/144428516/rope-implementation-in-llama Please also see this official implementation of Meta: https://github.com/meta-llama/llama/blob/8fac8befd776bc03242fe7bc2236cdb41b6c609c/llama/model.py#L100 |
Thanks for providing the explanation and code. I am still not clear how those two differ though: # Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2))) I think yours is only different if Here's a quick example of what I mean: import torch
theta_base = 10_000
for head_dim in range(1, 12):
before = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
after = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))
s = f"{torch.equal(before, after)} | head dim: {head_dim}, {before}, {after}"
print(s)
or am I missing something? In any case thanks for opening the discussion! |
Yeah, I think you are right. The first implemenation is slightly better because it provides consistent and expected behavior across both even and odd inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim)) if |
Although, it's slightly more expensive. But yeah, I am happy to accept that change since it's the main purpose is educational. Thanks! |
Yeah, I agree. Would be happy to have an article about RoPE on AoAI when it suits you in the future. I have read through the paper and several blog posts on this concept, but still could not fully understand it. But it's crucial for understanding nowadays PE encoding. |
It's on the list :) |
I might be wrong, but I think the code for
inv_freq
for RoPE seems not to be fully correct:Shouldn't it be divided through all dimensions (
head_dim
), not just half of them?Sources:
The text was updated successfully, but these errors were encountered: