-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: chunked prefill scheudler uses up swap on many n>=2 requests #5578
Comments
@rkooo567 any possible causes? |
To make my suggestion clear, - # Schedule new prefills.
- remaining_waiting, prefills = self._schedule_prefills(
- self.waiting, budget, curr_loras, enable_chunking=True)
+ if len(remaining_swapped) == 0:
+ # Schedule new prefills.
+ remaining_waiting, prefills = self._schedule_prefills(
+ self.waiting, budget, curr_loras, enable_chunking=True) on https://github.com/vllm-project/vllm/blob/v0.5.0.post1/vllm/core/scheduler.py#L871-L873 fixes the issue. However, the condition |
I think n>1 creates more sequences, so it is more likely to use swap/preemption (because there's higher pressure to kv cache). Checking remaining_swapped==0 makes sense to me actually. We should prioritize swapped requests over prefill anyway. (and if all swaps are scheduled, remaining swap becomes 0 anyway). @toslunar would you like to create a PR? |
Thank you @rkooo567. It makes sense. I created a PR. The diff is slightly different than my previous comment. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Your current environment
🐛 Describe the bug
Sending many
n>=2
(orbest_of>=2
) requests fills up CPU KV cache, more often if chunked prefill is enabled._schedule_chunked_prefill
schedules prefills even if there are swapped seq groupshttps://github.com/vllm-project/vllm/blob/v0.5.0.post1/vllm/core/scheduler.py#L871-L873
while
_schedule_default
does nothttps://github.com/vllm-project/vllm/blob/v0.5.0.post1/vllm/core/scheduler.py#L763-L766
To reproduce,
consumes CPU KV cache (
Running: 39 reqs, Swapped: 129 reqs
in the end)output
The text was updated successfully, but these errors were encountered: