Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANE-friendly static llama #8436

Merged
merged 12 commits into from
Feb 22, 2025
Merged

ANE-friendly static llama #8436

merged 12 commits into from
Feb 22, 2025

Conversation

metascroy
Copy link
Contributor

@metascroy metascroy commented Feb 13, 2025

This directory contains ANE-friendly Llama models.

Export model with:

python export.py -n /path/to/output/model.pte -p /path/to/params.json -c /path/to/model.pth --seq_length 64 --max_seq_length 1024 --coreml-quantize c4w

The runner is written in python and is only intended to serve as an example for how the model inputs should be processed; it is not performant.

Run model with:

python run.py -m /path/to/model.pte -p /path/to/params.json -t /path/to/tokenizer.model --seq_length 64 --max_seq_length 1024 --prompt "Once upon a time," --n_steps 512

The model here is based on a "sliding" cache, where old tokens are evicted from the cache. By default, the cache size is max_seq_length - seq_length, but you can explicitly pass in a smaller cache size (e.g., --cache_size 512). This can speed up computation and reduce memory. Keep in mind that once cache_size is reached, older tokens get evicted from the cache and do not participate in attention.

cc @kimishpatel @YifanShenSZ @cymbalrush

Copy link

pytorch-bot bot commented Feb 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8436

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 92a1be8 with merge base 52a3a9a (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2025
@metascroy metascroy changed the title init Static llama for CoreML Feb 13, 2025
Copy link

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@metascroy metascroy changed the title Static llama for CoreML This PR implements an ANE-friendly static llama Feb 19, 2025
@metascroy metascroy changed the title This PR implements an ANE-friendly static llama ANE-friendly static llama Feb 19, 2025
@metascroy metascroy marked this pull request as ready for review February 19, 2025 20:46

if not self.generate_full_logits:
# Only the last logit is used for the new generated token
h = h[:, input_length - 1, :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add .squeeze(1) to make h of 2d shape?

Copy link
Collaborator

@YifanShenSZ YifanShenSZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! The major changes I found are:

  1. The dedicated KV cache
  2. The split linear

Anything I missed?

)
self.v_caches[:, :, :, (self.cache_pos) : (self.cache_pos + length), :] = (
new_v_caches[:, :, :, start : (start + length), :]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks like an index put to me? Does it successfully run on ANE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InputManager and its methods (_update_cache, etc) are not part of the model. We will have a C++ implementation of it that runs on CPU. This python implementation is intended to serve as a reference for the C++ one.

torch.nn.Linear(in_features, self.common_size)
for _ in range(self.num_splits)
]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! So split linear is found to be more performant on ANE? Empirically 1024 is found to be the best?

PS: On our end we found split softmax would be more performant apple/coremltools#2418

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't done extensive testing yet on different hardware to find the right splitting values yet. I just noticed that 1024 works better on my M1 Pro recently.

On the SDPA pass: apple/coremltools#2418:

We observed something similar. We can get better Llama performance by processing tokens in smaller seq_length chunks (e.g., 256) (this not only chunks the SDPA, but all ops). This is easy enough to do, but it only chunks the Q seq_length (target_seq_length) in SDPA. It doesn't chunks the source_seq_length (which is more realistically the bigger value from the K/V caches, e.g., max_context_length). I suspect the chunking will help here too. But unlike chunking the target_seq_length, chunking the source_seq_length will require decomposing the SDPA op. Do you have plans to add support for this?

target_function_name="model2",
)
desc.default_function_name = "model1"
ct.utils.save_multifunction(desc, f"{output_dir}/combined.mlpackage")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core ML multifunction is already runable now via ExecuTorch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I need to remove this script from the PR. I'll do an update.

@mergennachin mergennachin self-requested a review February 20, 2025 15:03
@metascroy
Copy link
Contributor Author

Generally LGTM! The major changes I found are:

  1. The dedicated KV cache
  2. The split linear

Anything I missed?

Those are the main changes. The structure of the KV caches and attention mask are different from how they're usually constructed. For example, the current tokens are always on the right-most side of the attention mask, whereas usually they are somewhere in the middle. This means we can get the K-value with one static concat (k = concat(k_cache, k_curr)).

The model also distinguishes between the max_seq_length and the cache_size. cache_size can be less than max_seq_length. The effect of this is older tokens are evicted from the cache and don't serve in attention.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's build on top of it. Also there are lots of errors in CI

@metascroy
Copy link
Contributor Author

LGTM! Let's build on top of it. Also there are lots of errors in CI

I'll take a look at the CI failures.

@metascroy metascroy added the module: coreml Issues related to Apple's Core ML delegation and code under backends/apple/coreml/ label Feb 22, 2025
@metascroy metascroy merged commit 54dccc9 into main Feb 22, 2025
47 of 49 checks passed
@metascroy metascroy deleted the apple-llama branch February 22, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: coreml Issues related to Apple's Core ML delegation and code under backends/apple/coreml/
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants