diff --git a/.buildkite/nightly-benchmarks/tests/descriptions.md b/.buildkite/nightly-benchmarks/tests/descriptions.md index b960cc4dd20a4..891e4917070d9 100644 --- a/.buildkite/nightly-benchmarks/tests/descriptions.md +++ b/.buildkite/nightly-benchmarks/tests/descriptions.md @@ -19,7 +19,7 @@ This test suite aims to test vllm's throughput. - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed). - Output length: the corresponding output length of these 200 prompts. -- Batch size: no constraint, so that vllm can batch as many requests as GPU memory permits. +- Batch size: dynamically determined by vllm to achieve maximum throughput. - Models: llama-3 8B, llama-3 70B, mixtral 8x7B. - Evaluation metrics: throughput. @@ -33,7 +33,7 @@ This test suite aims to test vllm's real serving metrics. - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed). - Output length: the corresponding output length of these 200 prompts. -- Batch size: no constraint, so that vllm can batch as many requests as GPU memory permits. +- Batch size: dynamically determined by vllm and the arrival pattern of the requests. - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed). - Models: llama-3 8B, llama-3 70B, mixtral 8x7B. - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).