add benchmark for append_paged_kv_cache #583

abcdabcd987 · 2024-11-05T05:52:15Z

Add a python benchmark for append_paged_kv_cache. Currently, its performance is really bad for prefill.

For example, here's the result on H100:

model: l1b      seqlens: [1, 1, 1, 1, 1, 1, 1, 1]                 single_layer: 0.011ms all_layers:   0.173ms throughput:    3.035GB/s
model: l1b      seqlens: [4993, 1, 1, 1, 1, 1, 1, 1]              single_layer: 2.363ms all_layers:  37.807ms throughput:    8.667GB/s
model: l1b      seqlens: [5000]                                   single_layer: 2.346ms all_layers:  37.529ms throughput:    8.731GB/s
model: l1b      seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.301ms all_layers:   4.819ms throughput:   68.005GB/s
---
model: l3b      seqlens: [1, 1, 1, 1, 1, 1, 1, 1]                 single_layer: 0.009ms all_layers:   0.253ms throughput:    7.241GB/s
model: l3b      seqlens: [4993, 1, 1, 1, 1, 1, 1, 1]              single_layer: 2.342ms all_layers:  65.579ms throughput:   17.489GB/s
model: l3b      seqlens: [5000]                                   single_layer: 2.331ms all_layers:  65.270ms throughput:   17.571GB/s
model: l3b      seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.313ms all_layers:   8.752ms throughput:  131.045GB/s
---
model: l8b      seqlens: [1, 1, 1, 1, 1, 1, 1, 1]                 single_layer: 0.008ms all_layers:   0.264ms throughput:    7.955GB/s
model: l8b      seqlens: [4993, 1, 1, 1, 1, 1, 1, 1]              single_layer: 2.342ms all_layers:  74.937ms throughput:   17.491GB/s
model: l8b      seqlens: [5000]                                   single_layer: 2.330ms all_layers:  74.564ms throughput:   17.578GB/s
model: l8b      seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.312ms all_layers:   9.995ms throughput:  131.142GB/s
---
model: l70b-tp8 seqlens: [1, 1, 1, 1, 1, 1, 1, 1]                 single_layer: 0.008ms all_layers:   0.641ms throughput:    1.023GB/s
model: l70b-tp8 seqlens: [4993, 1, 1, 1, 1, 1, 1, 1]              single_layer: 2.252ms all_layers: 180.172ms throughput:    2.273GB/s
model: l70b-tp8 seqlens: [5000]                                   single_layer: 2.264ms all_layers: 181.145ms throughput:    2.261GB/s
model: l70b-tp8 seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.295ms all_layers:  23.582ms throughput:   17.369GB/s

@abcdabcd987

The performance of `append_paged_kv_cache` is terrible for small batch size, which is a known issue that we haven't fixed for a long time, this PR fixes it. This PR also adds support for non-contiguous append keys/values (which could be sliced from fused qkv matrix). We first call a triton kernel to convert `append_indptr` to `batch_indices` and `positions` (which is similar to [CSR2COO conversion](https://docs.nvidia.com/cuda/cusparse/#cusparse-t-csr2coo) in sparse matrix). After the conversion, we can use element parallelism instead of batch parallelism. It's also worth trying using triton for the second `AppendPagedKVCacheKernel` kernel, I think the performance should be fine. I'll leave it for future work. Some todo items: 1. add torch.compile support. After this PR (reference number can be found at #583 ): ```bash model: l1b seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.094ms throughput: 5.563GB/s model: l1b seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.014ms all_layers: 0.216ms throughput: 1514.280GB/s model: l1b seqlens: [5000] single_layer: 0.014ms all_layers: 0.216ms throughput: 1517.017GB/s model: l1b seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.014ms all_layers: 0.217ms throughput: 1510.863GB/s --- model: l3b seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.165ms throughput: 11.123GB/s model: l3b seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.021ms all_layers: 0.580ms throughput: 1975.732GB/s model: l3b seqlens: [5000] single_layer: 0.021ms all_layers: 0.586ms throughput: 1958.078GB/s model: l3b seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.021ms all_layers: 0.581ms throughput: 1973.174GB/s --- model: l8b seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.185ms throughput: 11.321GB/s model: l8b seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.021ms all_layers: 0.661ms throughput: 1982.815GB/s model: l8b seqlens: [5000] single_layer: 0.021ms all_layers: 0.662ms throughput: 1980.227GB/s model: l8b seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.021ms all_layers: 0.667ms throughput: 1964.861GB/s --- model: l70b-tp8 seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.457ms throughput: 1.434GB/s model: l70b-tp8 seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.009ms all_layers: 0.710ms throughput: 576.866GB/s model: l70b-tp8 seqlens: [5000] single_layer: 0.009ms all_layers: 0.685ms throughput: 598.366GB/s model: l70b-tp8 seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.009ms all_layers: 0.690ms throughput: 593.453GB/s ``` cc @abcdabcd987

abcdabcd987 added 4 commits November 5, 2024 05:36

add bench_append_paged_kv_cache

8e7ae9e

update

15f9d74

update

e4aede7

update

41cd5bb

abcdabcd987 requested a review from yzh119 November 5, 2024 05:52

abcdabcd987 changed the title ~~add bench_append_paged_kv_cache~~ add benchmark for append_paged_kv_cache Nov 5, 2024

zhyncs merged commit e5cafde into flashinfer-ai:main Nov 5, 2024

yzh119 mentioned this pull request Nov 6, 2024

perf: fix the performance issue of append_paged_kv_cache #588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add benchmark for append_paged_kv_cache #583

add benchmark for append_paged_kv_cache #583

abcdabcd987 commented Nov 5, 2024

add benchmark for append_paged_kv_cache #583

add benchmark for append_paged_kv_cache #583

Conversation

abcdabcd987 commented Nov 5, 2024