XLA optimized Implementation of StaticCache with `Tensor Indexing API` #31129

huzama · 2024-05-30T04:48:09Z

Use the Index_copy method to update the static cache inplace and avoid recompilation during each iteration in XLA

What does this PR do?

The PR focuses on avoiding repeated recompilation during each iteration in XLA by performing in-place updates on the static cache. Specifically, it replaces direct tensor indexing assignments with index_copy_ method calls. This technique ensures that cache updates are executed without triggering full tensor recompilations, improving runtime performance.

Code Changes

Original Code:

k_out[:, :, cache_position] = key_states
v_out[:, :, cache_position] = value_states

Updated Code:

k_out.index_copy_(2, cache_position, key_states)
v_out.index_copy_(2, cache_position, value_states)

Before submitting

Did you read the contributor guideline,
Pull Request section?

@zucchini-nlp @gante

Edit:

Discussion: 31126
The code was further updated to create a new list instead of performing in-place updates to avoid recompilation in XLA.

…recompilation during each iteration in XLA

…ilation during each iteration in XLA

LysandreJik · 2024-05-31T10:27:02Z

cc @ArthurZucker as well

ArthurZucker

FYI @gante and @zucchini-nlp

src/transformers/cache_utils.py

…d recompilation during each iteration in XLA" This reverts commit 1ad0a9a.

…tion. This is necessary to for XLA as tensors are not materilzed yet

torch.arange(past_length) where past_length keeps changing causes recompilation in XLA

… isinstance(past_key_value, StaticCache)

huzama · 2024-06-10T12:11:30Z

Summary of Changes

@tengomucho, I conducted experiments with both .index_copy_ and Python slicing methods on the GPU and found no significant performance differences. Consequently, I have reverted the StaticCache code to its original form.

In LLaMA, the KV-cache tensor slices are updated in-place; this leads to recompilation events every time a token is generated. To address this issue, we use index tensors and tensor.index_copy() ops to replace the in-place slice updates. Attention masks and output sequences also benefit from the same optimization. [Ref]

Based on @ArthurZucker's suggestions, and above-mentioned guide, I created a new class, StaticCacheXLA, with several key differences from the simple StaticCache class:

Updating Cache with Out-of-Place Operations with index_copy
XLA lazy tensors perform well with out-of-place operations:

transformers/src/transformers/cache_utils.py

Lines 1007 to 1014 in 71efd4c

    
           k_out = self.key_cache[layer_idx] 
        
           v_out = self.value_cache[layer_idx] 
        
           k_out = k_out.index_copy(2, cache_position, key_states) 
        
           v_out = v_out.index_copy(2, cache_position, value_states) 
        
           self.key_cache[layer_idx] = k_out 
        
           self.value_cache[layer_idx] = v_out

Get seq_len Out-of-Place with index_select

transformers/src/transformers/cache_utils.py

Lines 1027 to 1030 in 71efd4c

    
           item = key_cache.index_select(0, torch.tensor(0, device=device)) 
        
           head = item.index_select(1, torch.tensor(0, device=device)) 
        
           return head.any(dim=-1).sum()

Performance Improvements

Architecture Design

Using SPMD on LLAMA 3 following the guide from PyTorch's high-performance LLAMA 2 blog and a simple generate function with greedy decoding, I achieved the following results:

Results

These experiments were run on a TPU v3-8. The input sequence length was 256, and the maximum number of new tokens was 512. I can upload numbers on TPU v4-128 later.

As a result of these changes and optimizations, the new StaticCacheXLA class achieved a generation rate of 6.35 iterations per second with TPU utilization at 31%. In comparison, the original StaticCache achieved 5.5 iterations per second with TPU utilization at 27%.

StaticCacheXLA: 6.35 it/s, TPU utilization: 3.1%
StaticCache: 5.5 it/s, TPU utilization: 2.7%
Without Cache: 3 s/it, TPU utilization: 79%

Edit:
The significant disparity in TPU utilization arises from the difference between single token generation during cache-enabled runs and the full input processing during non-cache runs. Despite the low TPU utilization, cache-enabled generation is 10 times faster. Additionally, when employing in-place cache updates or Python slicing, the model recompiles the graph in proportion to the number of layers it contains. A similar pattern is observed with XLA on GPUs. I can provide further details, including the relevant code and performance metrics if needed.

ArthurZucker

in general looks ok, we need to make sure generate can use it so update the mapping to detect if xla is used then use this class WDYT?

huzama · 2024-06-13T06:10:37Z

@ArthurZucker Thank you for the feedback. If I understand correctly, you are suggesting that we should dynamically map to the appropriate cache implementation based on whether XLA is available. Would the following update reflect your suggestion?

class StaticCacheDefault(Cache):
    """Default static cache implementation."""
    pass

class StaticCacheXLA(Cache):
    """XLA-optimized static cache implementation."""
    pass

# Determine which StaticCache implementation to use based on the availability of XLA.
StaticCache = StaticCacheXLA if is_torch_xla_available() else StaticCacheDefault

tengomucho

Before Merging this, I would suggest to do another change. I think you can remove all methods except update andget_seq_length, that are the only two functions that are different from the parent class.

Anyway this seems great, a quick test on a TPU showed me a speedup of 10% in inference on gemma-2b!

src/transformers/cache_utils.py

… remove unnecessory code

tengomucho

LGTM!

gante

The core of the PR LGTM, thank you for working on it! 💛

Open questions before merging:

Utilization in generate: as we are adding more Cache classes, we are noticing that automagically initializing a Cache inside generate is becoming tricky. For my end, all that we need is to confirm that we can use this class as

xla_cache = StaticCacheXLA(...)
generate_outputs = model.generate(**inputs, past_key_values=xla_cache)

Tests :D Can we add a test to confirm that the new class doesn't change the model outputs? We can also test that the API described in 1. works in that test (example of a similar test:

transformers/tests/test_cache_utils.py

Line 171 in 43ee585

def test_dynamic_cache_hard(self):

)

huzama · 2024-06-14T13:04:05Z

Hello @gante,

A quick test on GPUs (as my TPUs are running jobs) with the following script revealed it can be used and it produces the same results as the original StaticCache.

from transformers import LlamaForCausalLM, AutoTokenizer
from transformers.cache_utils import StaticCacheXLA, StaticCache
import torch

model = LlamaForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B", torch_dtype=torch.bfloat16
).to(1)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

xla_cache = StaticCacheXLA(model.config, 1, 128, dtype=torch.bfloat16, device=1)
cache = StaticCache(model.config, 1, 128, dtype=torch.bfloat16, device=1)

input_ids = tokenizer("This is a test", return_tensors="pt").to(1)

out = model.generate(
    **input_ids, max_new_tokens=32, do_sample=False, past_key_values=cache
)

out_cache = model.generate(
    **input_ids, max_new_tokens=32, do_sample=False, past_key_values=xla_cache
)

torch.all(out == out_cache) 
# True

gante · 2024-06-14T13:18:37Z

@huzama awesome! Can we add it as a test? 💛 After the test is added, I'm more than happy to approve the PR!

tengomucho · 2024-06-14T13:26:51Z

@gante I even wonder if it would be possible just to patch the original StaticCache class, if they behave the same?

gante · 2024-06-14T13:34:26Z

@tengomucho if all that's needed is to replace slicing by .index_copy_, I'm happy with changing StaticCache itself -- as long as we add a comment like "k_out.index_copy_(2, cache_position, key_states) is equivalent to k_out[:, :, cache_position] = key_states, but with better generalized support", for readability

If that's the case, no new tests would be needed 🤗 @huzama

huzama · 2024-06-14T13:54:45Z

@gante I agree that combining both classes into a single implementation is a better approach. I'll add comments to explain the code and will push the changes with the merged implementation.

gante

LGTM, thank you for iterating 🤗

gante · 2024-06-14T14:33:51Z

@huzama you'll need to run make fixup on your transformers root dir and push the changes to make our CI happy 🤗

gante · 2024-06-14T16:17:47Z

(unrelated CI failure also observed on fresh PRs from main)

tengomucho

LGTM!

ArthurZucker

LGTM, could you make sure you run the slow tests for Llama compile, gemma compile etc ?

ArthurZucker · 2024-06-19T07:33:22Z

src/transformers/cache_utils.py

+        key_cache = self.key_cache[layer_idx]
+        device = key_cache.device
+
+        # index_select(dim, index) performs the same operation as item = tensor[..., index, ...]
+        # but it is used for better generality and flexibility.
+        # For more information, refer to: https://pytorch.org/cppdocs/notes/tensor_indexing.html
+
+        item = key_cache.index_select(0, torch.tensor(0, device=device))
+        head = item.index_select(1, torch.tensor(0, device=device))
+
+        return head.any(dim=-1).sum()


TBH this will be deprecated anyways!

HuggingFaceDocBuilderDev · 2024-06-19T07:55:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tengomucho · 2024-07-08T08:40:12Z

Hey @huzama are you planning to clean this up to get it merged? Otherwise let me know, I will be happy to take over to make StaticCache more efficient on XLA!

This is actually a ripoff of the work originally done as a contribution to transformers: huggingface/transformers#31129 The original contribution has not been merged yet, but it shows lower memory usage and better performance on XLA. So I think it's worth adding it here.

This is actually a ripoff of the work originally done as a contribution to transformers: huggingface/transformers#31129 The original contribution has not been merged yet, but it shows lower memory usage and better performance on XLA. So I think it's worth adding it here, to be integrated on optimum-tpu.

huzama · 2024-07-09T02:52:31Z

@tengomucho Thank you for taking over. I have been running some other experiments, and I lost track of this pull request. I appreciate your help in making StaticCache more efficient on XLA!

This is actually a ripoff of the work originally done as a contribution to transformers: huggingface/transformers#31129 The original contribution has not been merged yet, but it shows lower memory usage and better performance on XLA. So I think it's worth adding it here, to be integrated on optimum-tpu.

tengomucho · 2024-07-09T09:57:42Z

I opened another PR so I could push on the branch. Closing this!

huzama added 2 commits May 29, 2024 23:07

Use the Index_copy method to update static cache inplace and avoid …

10c06b3

…recompilation during each iteration in XLA

Use Index_copy method to update static cache inplace and avoid recomp…

1ad0a9a

…ilation during each iteration in XLA

huzama marked this pull request as draft May 30, 2024 05:13

huzama mentioned this pull request May 30, 2024

Request for Static Cache Support for XLA Compiler in Transformers #31126

Open

huzama marked this pull request as ready for review May 30, 2024 09:55

huzama marked this pull request as draft May 31, 2024 04:52

ArthurZucker reviewed Jun 5, 2024

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

huzama added 7 commits June 5, 2024 15:51

Revert "Use Index_copy method to update static cache inplace and avoi…

dc1ec08

…d recompilation during each iteration in XLA" This reverts commit 1ad0a9a.

Revert back StaticCache Class to original code

a23e7c0

Create new StaticCacheXLA class for XLA support

372695a

Update StaticCacheXLA with .index_copy for out-of-place Operation

670d432

get_seq_length in StaticCacheXLA uses out-of-place index_select opera…

ed93bf8

…tion. This is necessary to for XLA as tensors are not materilzed yet

Warn user to use cache_position when calling the forward path

9d67ac1

torch.arange(past_length) where past_length keeps changing causes recompilation in XLA

Inherit StaticCacheXLA from StaticCache instead for compatibilty with…

71efd4c

… isinstance(past_key_value, StaticCache)

huzama marked this pull request as ready for review June 10, 2024 12:12

huzama changed the title ~~Use the Index_copy method to update static cache inplace~~ Out-of-Place updates to StaticCache for XLA Jun 10, 2024

huzama requested review from ArthurZucker and tengomucho June 11, 2024 03:21

ArthurZucker reviewed Jun 12, 2024

View reviewed changes

tengomucho requested changes Jun 13, 2024

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

Refactor StaticCacheXLA to use index_copy_ for in-place operation and…

41b00d6

… remove unnecessory code

huzama requested a review from tengomucho June 14, 2024 01:33

huzama changed the title ~~Out-of-Place updates to StaticCache for XLA~~ XLA optimized Implementation of StaticCache with Tensor Indexing API Jun 14, 2024

tengomucho approved these changes Jun 14, 2024

View reviewed changes

Merge branch 'main' into cache_utils_xla

419a35c

huzama requested a review from tengomucho June 14, 2024 12:14

gante reviewed Jun 14, 2024

View reviewed changes

Merge StaticCache and StaticCacheXLA class

555e4c7

gante approved these changes Jun 14, 2024

View reviewed changes

gante requested a review from ArthurZucker June 14, 2024 14:33

Ran 'make fixup'

9fb7fbf

Merge branch 'huggingface:main' into cache_utils_xla

8bfb458

tengomucho approved these changes Jun 17, 2024

View reviewed changes

ArthurZucker approved these changes Jun 19, 2024

View reviewed changes

tengomucho mentioned this pull request Jul 8, 2024

feat(cache): use optimized StaticCache class for XLA huggingface/optimum-tpu#70

Merged

tengomucho mentioned this pull request Jul 9, 2024

feat(cache): StaticCache uses index_copy_ to avoid useless copy #31857

Merged

tengomucho closed this Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA optimized Implementation of StaticCache with `Tensor Indexing API` #31129

XLA optimized Implementation of StaticCache with `Tensor Indexing API` #31129

huzama commented May 30, 2024

LysandreJik commented May 31, 2024

ArthurZucker left a comment

huzama commented Jun 10, 2024 •

edited

Loading

ArthurZucker left a comment

huzama commented Jun 13, 2024

tengomucho left a comment

tengomucho left a comment

gante left a comment

huzama commented Jun 14, 2024

gante commented Jun 14, 2024

tengomucho commented Jun 14, 2024

gante commented Jun 14, 2024

huzama commented Jun 14, 2024

gante left a comment

gante commented Jun 14, 2024

gante commented Jun 14, 2024 •

edited

Loading

tengomucho left a comment

ArthurZucker left a comment

ArthurZucker Jun 19, 2024

HuggingFaceDocBuilderDev commented Jun 19, 2024

tengomucho commented Jul 8, 2024

huzama commented Jul 9, 2024

tengomucho commented Jul 9, 2024

XLA optimized Implementation of StaticCache with Tensor Indexing API #31129

XLA optimized Implementation of StaticCache with Tensor Indexing API #31129

Conversation

huzama commented May 30, 2024

What does this PR do?

Code Changes

Before submitting

Discussion: 31126

LysandreJik commented May 31, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

huzama commented Jun 10, 2024 • edited Loading

Summary of Changes

Performance Improvements

Architecture Design

Results

ArthurZucker left a comment

Choose a reason for hiding this comment

huzama commented Jun 13, 2024

tengomucho left a comment

Choose a reason for hiding this comment

tengomucho left a comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

huzama commented Jun 14, 2024

gante commented Jun 14, 2024

tengomucho commented Jun 14, 2024

gante commented Jun 14, 2024

huzama commented Jun 14, 2024

gante left a comment

Choose a reason for hiding this comment

gante commented Jun 14, 2024

gante commented Jun 14, 2024 • edited Loading

tengomucho left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 19, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 19, 2024

tengomucho commented Jul 8, 2024

huzama commented Jul 9, 2024

tengomucho commented Jul 9, 2024

XLA optimized Implementation of StaticCache with `Tensor Indexing API` #31129

XLA optimized Implementation of StaticCache with `Tensor Indexing API` #31129

huzama commented Jun 10, 2024 •

edited

Loading

gante commented Jun 14, 2024 •

edited

Loading