ANE-friendly static llama #8436

metascroy · 2025-02-13T00:21:55Z

This directory contains ANE-friendly Llama models.

Export model with:

python export.py -n /path/to/output/model.pte -p /path/to/params.json -c /path/to/model.pth --seq_length 64 --max_seq_length 1024 --coreml-quantize c4w

The runner is written in python and is only intended to serve as an example for how the model inputs should be processed; it is not performant.

Run model with:

python run.py -m /path/to/model.pte -p /path/to/params.json -t /path/to/tokenizer.model --seq_length 64 --max_seq_length 1024 --prompt "Once upon a time," --n_steps 512

The model here is based on a "sliding" cache, where old tokens are evicted from the cache. By default, the cache size is max_seq_length - seq_length, but you can explicitly pass in a smaller cache size (e.g., --cache_size 512). This can speed up computation and reduce memory. Keep in mind that once cache_size is reached, older tokens get evicted from the cache and do not participate in attention.

cc @kimishpatel @YifanShenSZ @cymbalrush

pytorch-bot · 2025-02-13T00:21:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8436

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 92a1be8 with merge base 52a3a9a ():

NEW FAILURE - The following job has failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-02-13T00:22:35Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

YIWENX14 · 2025-02-19T21:23:39Z

examples/apple/coreml/llama/llama_transformer.py

+
+        if not self.generate_full_logits:
+            # Only the last logit is used for the new generated token
+            h = h[:, input_length - 1, :]


Add .squeeze(1) to make h of 2d shape?

YifanShenSZ

Generally LGTM! The major changes I found are:

The dedicated KV cache
The split linear

Anything I missed?

YifanShenSZ · 2025-02-20T05:45:50Z

examples/apple/coreml/llama/llama_transformer.py

+            )
+            self.v_caches[:, :, :, (self.cache_pos) : (self.cache_pos + length), :] = (
+                new_v_caches[:, :, :, start : (start + length), :]
+            )


This still looks like an index put to me? Does it successfully run on ANE?

InputManager and its methods (_update_cache, etc) are not part of the model. We will have a C++ implementation of it that runs on CPU. This python implementation is intended to serve as a reference for the C++ one.

YifanShenSZ · 2025-02-20T05:50:34Z

examples/apple/coreml/llama/export.py

+                torch.nn.Linear(in_features, self.common_size)
+                for _ in range(self.num_splits)
+            ]
+        )


Awesome! So split linear is found to be more performant on ANE? Empirically 1024 is found to be the best?

PS: On our end we found split softmax would be more performant apple/coremltools#2418

We haven't done extensive testing yet on different hardware to find the right splitting values yet. I just noticed that 1024 works better on my M1 Pro recently.

On the SDPA pass: apple/coremltools#2418:

We observed something similar. We can get better Llama performance by processing tokens in smaller seq_length chunks (e.g., 256) (this not only chunks the SDPA, but all ops). This is easy enough to do, but it only chunks the Q seq_length (target_seq_length) in SDPA. It doesn't chunks the source_seq_length (which is more realistically the bigger value from the K/V caches, e.g., max_context_length). I suspect the chunking will help here too. But unlike chunking the target_seq_length, chunking the source_seq_length will require decomposing the SDPA op. Do you have plans to add support for this?

YifanShenSZ · 2025-02-20T05:53:24Z

examples/apple/coreml/llama/extract_and_combine.py

+        target_function_name="model2",
+    )
+    desc.default_function_name = "model1"
+    ct.utils.save_multifunction(desc, f"{output_dir}/combined.mlpackage")


Core ML multifunction is already runable now via ExecuTorch?

No, I need to remove this script from the PR. I'll do an update.

metascroy · 2025-02-20T17:32:27Z

Generally LGTM! The major changes I found are:

The dedicated KV cache

The split linear

Anything I missed?

Those are the main changes. The structure of the KV caches and attention mask are different from how they're usually constructed. For example, the current tokens are always on the right-most side of the attention mask, whereas usually they are somewhere in the middle. This means we can get the K-value with one static concat (k = concat(k_cache, k_curr)).

The model also distinguishes between the max_seq_length and the cache_size. cache_size can be less than max_seq_length. The effect of this is older tokens are evicted from the cache and don't serve in attention.

cccclai

LGTM! Let's build on top of it. Also there are lots of errors in CI

metascroy · 2025-02-20T22:15:54Z

LGTM! Let's build on top of it. Also there are lots of errors in CI

I'll take a look at the CI failures.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2025

metascroy changed the title ~~init~~ Static llama for CoreML Feb 13, 2025

metascroy changed the title ~~Static llama for CoreML~~ This PR implements an ANE-friendly static llama Feb 19, 2025

metascroy requested review from cccclai and kimishpatel February 19, 2025 20:43

metascroy changed the title ~~This PR implements an ANE-friendly static llama~~ ANE-friendly static llama Feb 19, 2025

metascroy marked this pull request as ready for review February 19, 2025 20:46

YIWENX14 reviewed Feb 19, 2025

View reviewed changes

metascroy requested a review from cymbalrush as a code owner February 20, 2025 04:12

cccclai requested a review from YifanShenSZ February 20, 2025 04:57

YifanShenSZ approved these changes Feb 20, 2025

View reviewed changes

mergennachin self-requested a review February 20, 2025 15:03

cccclai approved these changes Feb 20, 2025

View reviewed changes

metascroy force-pushed the apple-llama branch from 3a57fe9 to 4db0489 Compare February 21, 2025 01:37

metascroy added 10 commits February 21, 2025 10:33

init

a8efee7

up

3266e00

up

63dba55

up

9a011c4

up

37fb4f8

up

4c7eec3

lint

92528e9

up

cffa508

up

efc5382

up

64f0321

metascroy force-pushed the apple-llama branch from 4db0489 to 64f0321 Compare February 21, 2025 21:37

up

7b3bb13

lint

92a1be8

metascroy added the module: coreml Issues related to Apple's Core ML delegation and code under backends/apple/coreml/ label Feb 22, 2025

metascroy merged commit 54dccc9 into main Feb 22, 2025
47 of 49 checks passed

metascroy deleted the apple-llama branch February 22, 2025 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANE-friendly static llama #8436

ANE-friendly static llama #8436

metascroy commented Feb 13, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 13, 2025 •

edited

Loading

github-actions bot commented Feb 13, 2025

YIWENX14 Feb 19, 2025

YifanShenSZ left a comment

YifanShenSZ Feb 20, 2025

metascroy Feb 20, 2025

YifanShenSZ Feb 20, 2025

metascroy Feb 20, 2025

YifanShenSZ Feb 20, 2025

metascroy Feb 20, 2025

metascroy commented Feb 20, 2025

cccclai left a comment •

edited

Loading

metascroy commented Feb 20, 2025

ANE-friendly static llama #8436

ANE-friendly static llama #8436

Conversation

metascroy commented Feb 13, 2025 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Feb 13, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8436

❌ 1 New Failure

github-actions bot commented Feb 13, 2025

This PR needs a release notes: label

YIWENX14 Feb 19, 2025

Choose a reason for hiding this comment

YifanShenSZ left a comment

Choose a reason for hiding this comment

YifanShenSZ Feb 20, 2025

Choose a reason for hiding this comment

metascroy Feb 20, 2025

Choose a reason for hiding this comment

YifanShenSZ Feb 20, 2025

Choose a reason for hiding this comment

metascroy Feb 20, 2025

Choose a reason for hiding this comment

YifanShenSZ Feb 20, 2025

Choose a reason for hiding this comment

metascroy Feb 20, 2025

Choose a reason for hiding this comment

metascroy commented Feb 20, 2025

cccclai left a comment • edited Loading

Choose a reason for hiding this comment

metascroy commented Feb 20, 2025

metascroy commented Feb 13, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 13, 2025 •

edited

Loading

This PR needs a `release notes:` label

cccclai left a comment •

edited

Loading