Continuous batching load test stuck #5827

Kev1ntan · 2024-03-02T03:09:35Z

OS: Linux 2d078bb41859 5.15.0-83-generic #92~20.04.1-Ubuntu SMP Mon Aug 21 14:00:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

instance: 1xRTX 3090

load test tool: k6

Hi, i am doing load test for llama cpp server, but somehow the request only capped at the --parallel n. below i give the evidence

is the command for batch inference wrong?, because when the load test completed i try to manual send 1 request but the model didnt response anything(seems like the slot to released yet?)

any help is appreciated, thank you.

phymbert · 2024-03-02T05:26:37Z

Hi, please share the steps to reproduce your bench.

By default, the maximum number of http concurrent requests is set to the number of CPU cores. You can use --threads-http to increase it to the number of slots --parallel. @ggerganov I got your point. Let's initialize it by defautlt to n_slots.

Kev1ntan · 2024-03-03T00:21:23Z

@phymbert you can try run the server using below command:
./server -m ../models/mistral-7b-v0.1.Q8_0.gguf -c 2048 --port 9000 -ngl 33 -tb 64 -cb -np 64

then run k6 with 100 VU:

export const options = {
  vus: 100, // simulate 100 virtual users
  duration: '60s', // running the test for 60 seconds
};

if there are any http call fail, try to hit the server manually, in my case the server didnt response anything...

phymbert · 2024-03-03T06:11:26Z

I guess you have 8-16 CPU cores, so without specifying --threads-http=102 your server will stuck with 100 users:

server: init server http requests threads pool with --parallel if set #5836

phymbert · 2024-03-03T07:09:03Z

I did some tests, unfortunately, I have only an RTX 3050, so I tested with PHI-2 and only 32 slots and 32 users.

Using:

server --host localhost --port 8080 --model phi-2.Q4_K_M.gguf --alias phi-2 --cont-batching --metrics --parallel 32 --n-predict 32 -ngl 33 --threads-http 34 -tb 8 --batch-size 96 --ctx-size 4096 --log-format text

On:
Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes

K6 Script

import http from 'k6/http'
import {check, sleep} from 'k6'

export default function() {
    const data = {
        "messages": [
            {
                "role": "system",
                "content": "You are a kind AI assistant.",
            },
            {
                "role": "user",
                "content": "I believe the meaning of life is",
            }
        ],
        "model": "model",
        "max_tokens": 32,
        "stream": false,
    }
    let res = http.post('http://localhost:8080/v1/chat/completions',JSON.stringify(data), {
        headers: { 'Content-Type': 'application/json' },
    })

    check(res, {'success completion': (r) => r.status === 200})

    sleep(0.3)
}

export const options = {
    vus: 32, // simulate 100 virtual users
    duration: '300s', // running the test for 60 seconds
};

I have no issue if the parameter --http-threads 34 is set.

You can export server metrics during the test:

curl http://localhost:8080/metrics
# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 73074
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 39707
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 68
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 3
# HELP llamacpp:kv_cache_usage_ratio KV-cache usage. 1 means 100 percent usage.
# TYPE llamacpp:kv_cache_usage_ratio gauge
llamacpp:kv_cache_usage_ratio 0
# HELP llamacpp:kv_cache_tokens KV-cache tokens.
# TYPE llamacpp:kv_cache_tokens gauge
llamacpp:kv_cache_tokens 1907
# HELP llamacpp:requests_processing Number of request processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 32
# HELP llamacpp:requests_deferred Number of request deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

If you still face issue, please share all these step on your end.

Kev1ntan · 2024-03-03T12:17:47Z

@phymbert i just tried with ./server -m ../models/mistral-7b-v0.1.Q8_0.gguf -c 2048 --host localhost --port 9000 -ngl 33 b 64 -cb -np 64 --threads-http 66 and found something wierd:

in my test before using /completion endpoint, request stuck at --paraller number
current test still faced same issue with /completion endpoint
using v1/chat/completions, didnt stuck anymore with --paraller number, but still got some request timeout.below is both the test i just run

with /completion:

with /v1/chat/completions:

this one is better but still got 6 request timeout from 632 reqs

do you have any insight?

phymbert · 2024-03-03T12:22:50Z

Nice to hear. Without sharing your k6 script, I cannot help so much. Sometimes, some requests can be slower for multiple reasons. You can just accept it or increase timeout. If you need HA, you can scale the number of servers.

Note, fundamentally, there is no difference between /completion and /chat/completions.

Kev1ntan · 2024-03-03T12:24:19Z

here is the k6 script @phymbert :

import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
  vus: 50,
  duration: '60s',
};
export default function () {
    let headers = { 'Content-Type': 'application/json' };
    http.post('http://localhost:9000/v1/chat/completions', JSON.stringify({
        "messages": [
            {
                "role": "user",
                "content": "do you know indonesia?, if yes please describe indonesia in details"
            }
        ]
    }), { headers: headers });
    sleep(1); // virtual user will wait for 1 second before the next request
}

Kev1ntan · 2024-03-03T12:29:59Z

please try with /completion also @phymbert

phymbert · 2024-03-03T13:02:31Z

please try with /completion also @phymbert

I will test later on, but from my understanding, they are equivalent, just different data structures.
Do you see any difference in terms of tokens/s or RPM ?

Kev1ntan · 2024-03-03T14:02:35Z

please try with /completion also @phymbert

I will test later on, but from my understanding, they are equivalent, just different data structures. Do you see any difference in terms of tokens/s or RPM ?

/completion

import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
  vus: 60, 
  duration: '60s', 
};
export default function () {
    let headers = { 'Content-Type': 'application/json' };
    let response = http.post('http://localhost:9000/completion', JSON.stringify({
        "prompt": "i am"
    }), { headers: headers });
    sleep(1); // virtual user will wait for 1 second before the next request
}

results:

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 216
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 6075
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 27
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 3
# HELP llamacpp:kv_cache_usage_ratio KV-cache usage. 1 means 100 percent usage.
# TYPE llamacpp:kv_cache_usage_ratio gauge
llamacpp:kv_cache_usage_ratio 0
# HELP llamacpp:kv_cache_tokens KV-cache tokens.
# TYPE llamacpp:kv_cache_tokens gauge
llamacpp:kv_cache_tokens 1664
# HELP llamacpp:requests_processing Number of request processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 49
# HELP llamacpp:requests_deferred Number of request deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

/chat/completions

import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
  vus: 60,
  duration: '60s',
};
export default function () {
    let headers = { 'Content-Type': 'application/json' };
    http.post('http://localhost:9000/v1/chat/completions', JSON.stringify({
        "messages": [
            {
                "role": "user",
                "content": "do you know indonesia?, if yes please describe indonesia in details"
            }
        ],
    }), { headers: headers });
    sleep(1); // virtual user will wait for 1 second before the next request
}

results:

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 18788
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 4697
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 72
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 0
# HELP llamacpp:kv_cache_usage_ratio KV-cache usage. 1 means 100 percent usage.
# TYPE llamacpp:kv_cache_usage_ratio gauge
llamacpp:kv_cache_usage_ratio 0
# HELP llamacpp:kv_cache_tokens KV-cache tokens.
# TYPE llamacpp:kv_cache_tokens gauge
llamacpp:kv_cache_tokens 1216
# HELP llamacpp:requests_processing Number of request processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
# HELP llamacpp:requests_deferred Number of request deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

phymbert · 2024-03-03T14:07:14Z

max_tokens is n_predict in /completionand messages are in input. please see the Readme

Kev1ntan · 2024-03-03T14:18:48Z

max_tokens is n_predict in /completionand messages are in input. please see the Readme

that one right?, using prompt also working

See #5827

…mparison (#5941) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>

…mparison (ggml-org#5941) * server: bench: Init a bench scenario with K6 See ggml-org#5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <[email protected]>

Kev1ntan added the bug-unconfirmed label Mar 2, 2024

phymbert mentioned this issue Mar 2, 2024

server: init server http requests threads pool with --parallel if set #5836

Merged

ggerganov closed this as completed in #5836 Mar 3, 2024

phymbert added a commit that referenced this issue Mar 8, 2024

server: bench: Init a bench scenario with K6

a9ceb73

See #5827

phymbert added a commit that referenced this issue Mar 8, 2024

server: bench: Init a bench scenario with K6

27afbb6

See #5827

phymbert added a commit that referenced this issue Mar 8, 2024

server: bench: Init a bench scenario with K6

68d1d8f

See #5827

phymbert mentioned this issue Mar 8, 2024

server: benchmark: chat/completions scenario and other llm servers comparison #5941

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous batching load test stuck #5827

Continuous batching load test stuck #5827

Kev1ntan commented Mar 2, 2024 •

edited

Loading

phymbert commented Mar 2, 2024

Kev1ntan commented Mar 3, 2024

phymbert commented Mar 3, 2024

phymbert commented Mar 3, 2024

Kev1ntan commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024 •

edited

Loading

Kev1ntan commented Mar 3, 2024 •

edited

Loading

Kev1ntan commented Mar 3, 2024

phymbert commented Mar 3, 2024

Kev1ntan commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024 •

edited

Loading

Kev1ntan commented Mar 3, 2024

Continuous batching load test stuck #5827

Continuous batching load test stuck #5827

Comments

Kev1ntan commented Mar 2, 2024 • edited Loading

phymbert commented Mar 2, 2024

Kev1ntan commented Mar 3, 2024

phymbert commented Mar 3, 2024

phymbert commented Mar 3, 2024

Kev1ntan commented Mar 3, 2024 • edited Loading

phymbert commented Mar 3, 2024 • edited Loading

Kev1ntan commented Mar 3, 2024 • edited Loading

Kev1ntan commented Mar 3, 2024

phymbert commented Mar 3, 2024

Kev1ntan commented Mar 3, 2024 • edited Loading

phymbert commented Mar 3, 2024 • edited Loading

Kev1ntan commented Mar 3, 2024

Kev1ntan commented Mar 2, 2024 •

edited

Loading

Kev1ntan commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024 •

edited

Loading

Kev1ntan commented Mar 3, 2024 •

edited

Loading

Kev1ntan commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024 •

edited

Loading