Optimize data movement #20

WoosukKwon · 2023-04-02T00:30:14Z

Should be merged after #15 .

The changes in this PR eliminate the need for redundant data movements such as torch.cat, torch.stack, and torch.contiguous, which were previously used to align input and output shapes. The PR modifies existing kernels and adds new kernels to accommodate non-contiguous tensors, making these data movement operators unnecessary.

zhuohan123

LGTM!

zhuohan123 · 2023-04-02T06:18:01Z

cacheflow/models/attention.py

+        # Directly call FlashAttention's internal function to avoid allocating
+        # a new tensor for the output.
+        _flash_attn_forward(
+            query,
+            key,
+            value,
+            output,
+            cumulative_prompt_lens,
+            cumulative_prompt_lens,
+            max_prompt_len,
+            max_prompt_len,
+            dropout_p=0.0,
+            softmax_scale=self.scale,
            causal=True,
-        )[0]
-        # FIXME(woosuk): Unnecessary copy. Optimize this.
-        output.copy_(out, non_blocking=True)
+            return_softmax=False,
+        )


Just curious, so flash attention natively supports non-contiguous QKV tensors?

Yes. It actually requires qkv tensor of shape [num_tokens, 3, num_heads, head_size]. Previously, we inserted torch.stack to meet this shape requirement, and this PR eliminates this inefficiency.

zhuohan123 · 2023-04-02T07:23:18Z

Speed before this PR on 1 A100:

ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model ~/hf-llama/llama-13b/
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='/home/ubuntu/hf-llama/llama-13b/', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 06:25:10,878 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1977, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.53s/it]
Avg latency: 3.526289224624634 seconds
ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 06:27:55,300 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.54s/it]
Avg latency: 3.5404738585154214 seconds

After:

ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 07:17:35,120 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1975, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.43s/it]
Avg latency: 3.432361443837484 seconds
ubuntu@ray-zhuohan-cf-head-2c23a277-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model ~/hf-llama/llama-13b/
Namespace(batch_size=8, block_size=8, dtype='half', input_len=32, max_batch_size=2560, model='/home/ubuntu/hf-llama/llama-13b/', model_path='~/.cacheflow/model_weights', output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1)
2023-04-02 07:19:00,665 INFO worker.py:1535 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266/
# GPU blocks: 1977, # CPU blocks: 3276
Warm up step
Profile step: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.27s/it]
Avg latency: 3.2731640338897705 seconds

Produce artifacts for bare metal installation in Dockerfile.openvino

Fix logging lint errors

…factor Dockerfile improvements: multistage

* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (vllm-project#25) --------- Co-authored-by: madamczykhabana <[email protected]>

zhuohan123 and others added 18 commits March 30, 2023 16:49

Merge QKV for OPT

28df307

merge qkv for llama

2e417f5

fix the code according to woosuk's comment

06f23ff

Merge branch 'main' into qkv_combined

da0fdd2

Merge branch 'main' into qkv_combined

b1ba1e4

Add SiluAndMul

9c5eca0

Remove

a5719c1

Merge branch 'activation' into qkv_combined

07fb828

Add SiluAndMul for fused SwiGLU

47622ec

Optimize data movement in attention

c3816b8

Add activation_ops to setup.py

b8d0024

Make rotary embedding in-place

3f8dd53

Bug fix

07e2bca

Roll back attention arguments

8a37545

Merge branch 'main' into data-move

0417554

Fix test for reshape_and_cache

e2a47cc

Fix test for rotary_embedding_neox

df402bb

Fix test for attention kernels

7152271

WoosukKwon requested a review from zhuohan123 April 2, 2023 05:17

WoosukKwon mentioned this pull request Apr 2, 2023

Merge QKV into one linear layer #15

Merged

zhuohan123 approved these changes Apr 2, 2023

View reviewed changes

Merge branch 'main' into data-move

0132133

WoosukKwon merged commit 897cb2a into main Apr 2, 2023

WoosukKwon deleted the data-move branch April 2, 2023 07:30

WoosukKwon mentioned this pull request Apr 5, 2023

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script #27

Merged

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Optimize data movement (vllm-project#20)

a3ea458

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Apr 17, 2024

Merge pull request vllm-project#20 from mzegla/produce_artifacts

3570043

Produce artifacts for bare metal installation in Dockerfile.openvino

tdg5 pushed a commit to tdg5/vllm that referenced this pull request Apr 25, 2024

Merge pull request vllm-project#20 from tdg5/exp-2

e27e61e

Fix logging lint errors

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#20 from ROCm/Dockerfile_multistage_re…

23c696d

…factor Dockerfile improvements: multistage

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize data movement #20

Optimize data movement #20

WoosukKwon commented Apr 2, 2023

zhuohan123 left a comment

zhuohan123 Apr 2, 2023

WoosukKwon Apr 2, 2023

zhuohan123 commented Apr 2, 2023

Optimize data movement #20

Optimize data movement #20

Conversation

WoosukKwon commented Apr 2, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Apr 2, 2023

Choose a reason for hiding this comment

WoosukKwon Apr 2, 2023

Choose a reason for hiding this comment

zhuohan123 commented Apr 2, 2023