-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous batching load test stuck #5827
Comments
Hi, please share the steps to reproduce your bench. By default, the maximum number of http concurrent requests is set to the number of CPU cores. You can use |
@phymbert you can try run the server using below command: then run k6 with 100 VU:
if there are any http call fail, try to hit the server manually, in my case the server didnt response anything... |
I guess you have 8-16 CPU cores, so without specifying |
I did some tests, unfortunately, I have only an RTX 3050, so I tested with PHI-2 and only 32 slots and 32 users. Using: server --host localhost --port 8080 --model phi-2.Q4_K_M.gguf --alias phi-2 --cont-batching --metrics --parallel 32 --n-predict 32 -ngl 33 --threads-http 34 -tb 8 --batch-size 96 --ctx-size 4096 --log-format text On: K6 Scriptimport http from 'k6/http'
import {check, sleep} from 'k6'
export default function() {
const data = {
"messages": [
{
"role": "system",
"content": "You are a kind AI assistant.",
},
{
"role": "user",
"content": "I believe the meaning of life is",
}
],
"model": "model",
"max_tokens": 32,
"stream": false,
}
let res = http.post('http://localhost:8080/v1/chat/completions',JSON.stringify(data), {
headers: { 'Content-Type': 'application/json' },
})
check(res, {'success completion': (r) => r.status === 200})
sleep(0.3)
}
export const options = {
vus: 32, // simulate 100 virtual users
duration: '300s', // running the test for 60 seconds
}; I have no issue if the parameter You can export server metrics during the test: curl http://localhost:8080/metrics
# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 73074
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 39707
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 68
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 3
# HELP llamacpp:kv_cache_usage_ratio KV-cache usage. 1 means 100 percent usage.
# TYPE llamacpp:kv_cache_usage_ratio gauge
llamacpp:kv_cache_usage_ratio 0
# HELP llamacpp:kv_cache_tokens KV-cache tokens.
# TYPE llamacpp:kv_cache_tokens gauge
llamacpp:kv_cache_tokens 1907
# HELP llamacpp:requests_processing Number of request processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 32
# HELP llamacpp:requests_deferred Number of request deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0 If you still face issue, please share all these step on your end. |
@phymbert i just tried with
with /v1/chat/completions: do you have any insight? |
Nice to hear. Without sharing your k6 script, I cannot help so much. Sometimes, some requests can be slower for multiple reasons. You can just accept it or increase timeout. If you need HA, you can scale the number of servers. Note, fundamentally, there is no difference between |
here is the k6 script @phymbert :
|
please try with /completion also @phymbert |
I will test later on, but from my understanding, they are equivalent, just different data structures. |
/completion
results:
![]() /chat/completions
results:
![]() |
|
…mparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>
…mparison (ggml-org#5941) * server: bench: Init a bench scenario with K6 See ggml-org#5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>
…mparison (ggml-org#5941) * server: bench: Init a bench scenario with K6 See ggml-org#5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>
…mparison (ggml-org#5941) * server: bench: Init a bench scenario with K6 See ggml-org#5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>
…mparison (ggml-org#5941) * server: bench: Init a bench scenario with K6 See ggml-org#5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>
OS: Linux 2d078bb41859 5.15.0-83-generic #92~20.04.1-Ubuntu SMP Mon Aug 21 14:00:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
instance: 1xRTX 3090
load test tool: k6
Hi, i am doing load test for llama cpp server, but somehow the request only capped at the --parallel n. below i give the evidence



is the command for batch inference wrong?, because when the load test completed i try to manual send 1 request but the model didnt response anything(seems like the slot to released yet?)
any help is appreciated, thank you.
The text was updated successfully, but these errors were encountered: