perf: split kv-cache for prefill/append kernels #310

yzh119 · 2024-06-17T08:54:06Z

Duplicate of #75, but re-based on the main branch.

Note that to support CUDAGraph, we cannot make kv_chunk_size a function argument, which will be passed by value, and cannot change once captured by CUDAGraph. Instead, we pass kv_chunk_size through a kv_chunk_size_ptr which is a pointer to a global memory address that stores the kv_chunk_size, its value can be set in BeginForward fuctions.

Cascade inference test was failed for a while, this PR fixes it. Also fixes some of formats issues of previous PR #310.

@ibsidorenko

🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.4...v0.1.0) (2024-06-20) ### Highlights * Support any GQA group size support for tensor-cores kernels. * Support any page size support for tensor-cores kernels. * Support CUDA-Graph for prefill/decode APIs. * Add an option to accelerate decode kernels with Tensor Cores. * Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor) * Support logits cap in Grok-1 models. * Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html) * PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/sampling.html) ### Acknowledgement We thank [@ibsidorenko](https://github.com/ibsidorenko), [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU), [@Yard1](https://github.com/Yard1) [@AgrawalAmey](https://github.com/AgrawalAmey), [@xuzhenqi](https://github.com/xuzhenqi), [@mgerstgrasser](https://github.com/mgerstgrasser), [@esmeetu](https://github.com/esmeetu), [@yz-tang](https://github.com/yz-tang), [@HSQ79815](https://github.com/HSQ79815), [@Qubitium](https://github.com/Qubitium), [@shreygupta2809](https://github.com/shreygupta2809), [@sighingnow](https://github.com/sighingnow), [@vinx13](https://github.com/vinx13), [@tqchen](https://github.com/tqchen), [@merrymercy](https://github.com/merrymercy), [@comaniac](https://github.com/comaniac) and many others for their contributions and helpful discussions for 0.0.5 release. ### Refactor * support any GQA group size for tensor-cores kernels ([#301](#301)) ([c111ca](c111ca6)) * support any page size for tensor-cores kernels ([#306](#306)) ([82fd8c](82fd8c7)) ### Features * add `use_tensor_cores` option to decode kernels to accelerate GQA ([#317](#317)) ([3b50dd5](3b50dd5)) * add group gemm operators ([#282](#282)) ([e08ba42](e08ba42)) * initial support of distributed operators ([#289](#289)) ([03553da](03553da)) * initial support of logits hook ([#298](#298)) ([ab1e2ad](ab1e2ad)) * Separate Q and KV dtypes for decode ([#286](#286)) ([5602659](5602659)) * support cuda graph for batched multi-query(prefill/append) attention ([#275](#275)) ([83ceb67](83ceb67)) * support cuda graph for batched multi-query(prefill/append) attention ([#277](#277)) ([24cc583](24cc583)) * support custom attention mask in prefill/append attention kernels ([#266](#266)) ([7304282](7304282)) * fused speculative sampilng kernels ([#259](#259)) ([cea2bb](cea2bb9)) * expose sampling APIs in pytorch ([#238](#238)) ([092902](0929023)) ### Performance Improvements * initial cuda graph support ([#256](#256)) ([7e9cc7f](7e9cc7f)) * split kv-cache for prefill/append kernels ([#310](#310)) ([f0bb0a3](f0bb0a3)) * use packed bit array for attention mask ([#308](#308)) ([3d43dc9](3d43dc9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <[email protected]>

yzh119 mentioned this pull request Jun 18, 2024

[WIP][Feature] Support KV Partition for BatchPrefill kernel for Paged & Ragged KV-Cache. #75

Closed

yzh119 added 3 commits June 18, 2024 09:47

upd

4e5f3fd

wip

b96efc9

wip

9e2f0eb

yzh119 force-pushed the split-qo-kv branch from 073c7a0 to 9e2f0eb Compare June 18, 2024 09:47

yzh119 added 18 commits June 18, 2024 10:01

wip

7699312

upd

66b783f

upd

cde05e8

upd

b4dbe17

bugfix

28816b6

i'm tired, but I don't have time to sleep

6eb89b8

bugfix

2896d2b

fix tests and bench

08d6521

typo

4c8131b

fix binary search typo

6619f65

bugfix

6c96419

fix tvm wrapper

2e2efd7

bugfix

4cc929c

bugfix

127f9c1

bugfix

3216998

seems work well

8f651cd

formatter

c34aa69

initial attemp in supporting cudagraph

4adcf23

yzh119 marked this pull request as ready for review June 19, 2024 21:18

yzh119 added 6 commits June 19, 2024 21:57

bugfix

04b7c75

bugfix

0c8ec79

fix pytorch interface

9518b17

upd

39fcb8b

improve tests

09ab5ac

another bunch of bugfix

e64e8b0

yzh119 added 9 commits June 20, 2024 00:39

improved test for batch ragged prefill

0368154

fix silly bug

925770b

cudagraph compatibility

5fcd952

fix handler

6db4d70

fix prefill.py

43737df

fix bench batch decode

91b7412

upd

1980243

fix mask

822d4e4

bugfix batch_prefill.cu

79cc17a

yzh119 merged commit f0bb0a3 into main Jun 20, 2024

github-actions bot mentioned this pull request Jun 20, 2024

chore(main): release 0.0.5 #232

Merged

yzh119 mentioned this pull request Jun 20, 2024

bugfix: fix cascade test #315

Merged

yzh119 added a commit that referenced this pull request Jun 20, 2024

bugfix: fix cascade test (#315)

2ef20c1

Cascade inference test was failed for a while, this PR fixes it. Also fixes some of formats issues of previous PR #310.

yzh119 deleted the split-qo-kv branch June 20, 2024 17:15

github-actions bot mentioned this pull request Jul 31, 2024

chore(main): release 0.1.4 #415

Merged

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: split kv-cache for prefill/append kernels #310

perf: split kv-cache for prefill/append kernels #310

yzh119 commented Jun 17, 2024 •

edited

Loading

perf: split kv-cache for prefill/append kernels #310

perf: split kv-cache for prefill/append kernels #310

Conversation

yzh119 commented Jun 17, 2024 • edited Loading

yzh119 commented Jun 17, 2024 •

edited

Loading