Chunked prefill support #392

Juelianqvq · 2024-07-24T05:29:52Z

Any plan on this?

yzh119 · 2024-07-24T05:59:06Z

Hi, I don't see why do we need special support for chunked prefill, the paper that proposes chunked prefill (sarashi-serve) uses flashinfer and I suppose we already support this feature?

Juelianqvq · 2024-07-24T06:04:07Z

Hi, I don't see why do we need special support for chunked prefill, the paper that proposes chunked prefill (sarashi-serve) already uses flashinfer and I suppose we already support this feature?

Thanks for your kind reply. I'm sorry I've experienced a bad misaligned result when using chunked prefill + flashinfer in vLLM? Further investigation is needed.

yzh119 · 2024-07-24T06:11:45Z

Thanks for letting me know, there might be some misconfiguration on vLLM side for chunked prefill and I'd love to help fix the issue. Can you point me to the implementation to chunked prefill in vLLM?

cc @LiuXiaoxuanPKU as lily might provide some useful insights.

Juelianqvq · 2024-07-24T06:20:42Z

Thanks for letting me know, there might be some misconfiguration on vLLM side for chunked prefill and I'd love to help fix the issue. Can you point me to the implementation to chunked prefill in vLLM?

cc @LiuXiaoxuanPKU as lily might provide some useful insights.

https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flashinfer.py#L197 & https://github.com/vllm-project/vllm/blob/main/vllm/worker/model_runner.py searching if self.backend_name == "flashinfer":

jon-chuang · 2024-08-13T00:39:15Z

Hello @Juelianqvq there is an ongoing effort to unify vLLM's use of flash attention as it currently calls prefill and decode kernels separately. I suspect that a similar situation is happening for flash-infer, will investigate.

vllm-project/vllm#6052

elfiegg · 2024-10-23T22:40:43Z

What is the status of this?

I have a pending PR for supporting chunked-prefill with FlashInfer but I sometimes hit Illegal Memory Access issue with FlashInfer's BatchDecodeWithKvCache kernel when running mixed batch of prefill + decode tokens (issue is flaky). I'm wondering if anyone had the same issue and maybe we can collaborate on debugging.

zhyncs · 2024-10-23T22:42:50Z

I think it supports. We use FlashInfer in SGLang, and we support chunked prefill.

jon-chuang mentioned this issue Aug 13, 2024

[Feature]: Integrate flash-infer FP8 KV Cache Chunked-Prefill (Append Attention) vllm-project/vllm#7450

Open

zhyncs closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked prefill support #392

Chunked prefill support #392

Juelianqvq commented Jul 24, 2024

yzh119 commented Jul 24, 2024 •

edited

Loading

Juelianqvq commented Jul 24, 2024

yzh119 commented Jul 24, 2024

Juelianqvq commented Jul 24, 2024 •

edited

Loading

jon-chuang commented Aug 13, 2024

elfiegg commented Oct 23, 2024

zhyncs commented Oct 23, 2024

Chunked prefill support #392

Chunked prefill support #392

Comments

Juelianqvq commented Jul 24, 2024

yzh119 commented Jul 24, 2024 • edited Loading

Juelianqvq commented Jul 24, 2024

yzh119 commented Jul 24, 2024

Juelianqvq commented Jul 24, 2024 • edited Loading

jon-chuang commented Aug 13, 2024

elfiegg commented Oct 23, 2024

zhyncs commented Oct 23, 2024

yzh119 commented Jul 24, 2024 •

edited

Loading

Juelianqvq commented Jul 24, 2024 •

edited

Loading