mgoin triggered nightly on refs/heads/merge-upstream-0.4.0-to-main · neuralmagic/nm-vllm@b3d607a

# :warning: **Performance Alert** :warning: Possible performance regression was detected for benchmark **'bigger_is_better'**. Benchmark result of this commit is worse than the previous benchmark result exceeding threshold `1.10`. | Benchmark suite | Current: b3d607a9022ebd492a4c220401cad0b1ae126f8c | Previous: cbe584e8517a8768af656eeb2a929fc769d39157 | Ratio | |-|-|-|-| | `{"name": "request_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `16.55643584813369` prompts/s | `24.241633818723457` prompts/s | `1.46` | | `{"name": "token_throughput", "description": "VLLM Engine prefill throughput - 2:4 Sparse (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 128,\n \"output-len\": 1,\n \"num-prompts\": 1,\n \"sparsity\": \"semi_structured_sparse_w16a16\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `2135.780224409246` tokens/s | `3127.1707626153257` tokens/s | `1.46` | | `{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `2.4646624872173137` prompts/s | `2.830534125905744` prompts/s | `1.15` | | `{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 4\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `320.4061233382508` tokens/s | `367.9694363677467` tokens/s | `1.15` | | `{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `9.994245912863361` prompts/s | `11.00926975419417` prompts/s | `1.10` | | `{"name": "token_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\": 16\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `1299.251968672237` tokens/s | `1431.2050680452421` tokens/s | `1.10` | | `{"name": "request_throughput", "description": "VLLM Engine decode throughput - Dense (synthetic)\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 2,\n \"output-len\": 128,\n \"num-prompts\":

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

# :warning: **Performance Alert** :warning: Possible performance regression was detected for benchmark **'smaller_is_better'**. Benchmark result of this commit is worse than the previous benchmark result exceeding threshold `1.10`. | Benchmark suite | Current: b3d607a9022ebd492a4c220401cad0b1ae126f8c | Previous: cbe584e8517a8768af656eeb2a929fc769d39157 | Ratio | |-|-|-|-| | `{"name": "median_tpot_ms", "description": "VLLM Serving - Dense\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `12.340934612786274` ms | `11.208039014005106` ms | `1.10` | | `{"name": "median_request_latency", "description": "VLLM Serving - Dense\nmodel - TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"300,1\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `2286.7328040001667` ms | `2028.231687000698` ms | `1.13` | | `{"name": "mean_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `56238.96208308393` ms | `51017.71923265464` ms | `1.10` | | `{"name": "median_ttft_ms", "description": "VLLM Serving - Dense\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax-model-len - 4096\nsparsity - None\nbenchmark_serving {\n \"nr-qps-pair_\": \"750,2.5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `68497.120086` ms | `62219.41096000046` ms | `1.10` | | `{"name": "mean_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `31044.503353522032` ms | `28006.855318505364` ms | `1.11` | | `{"name": "median_ttft_ms", "description": "VLLM Serving - Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50\nmax-model-len - 4096\nsparsity - sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `30009.372049000376` ms | `26610.821029499675` ms | `1.13` | | `{"name": "median_request_latency", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "python_version": "3.10.12 (main, Mar 7 2024, 18:39:53) [GCC 9.4.0]", "torch_version": "2.1.2+cu121"}` | `15299.262414000623` ms | `12283.535647499775` ms | `1.25` | | `{"name": "mean_tpot_ms", "description": "VLLM Serving - 2:4 Sparse\nmodel - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned2.4\nmax-model-len - 4096\nsparsity - semi_structured_sparse_w16a16\nbenchmark_serving {\n \"nr-qps-pair_\": \"1500,5\",\n \"dataset\": \"sharegpt\"\n}", "gpu_description": "NVIDIA A10G x 1", "vllm_version": "0.1.0", "py

NIGHTLY-MULTI / TEST (aws-avx2-192G-4-a10g-96G) / TEST

The job running on runner avx2_a10g_4_i-0725d21226c9fcc47 has exceeded the maximum execution time of 480 minutes.

NIGHTLY-SOLO / TEST (aws-avx2-192G-4-a10g-96G) / TEST

The job running on runner avx2_a10g_4_i-0fa8a27e2dbcc2c1e has exceeded the maximum execution time of 480 minutes.

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 24.241633818723457 and current value is 16.55643584813369. It is 1.4641819073309834x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 3127.1707626153257 and current value is 2135.780224409246. It is 1.4641819073309836x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 2.830534125905744 and current value is 2.4646624872173137. It is 1.1484469539281672x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 367.9694363677467 and current value is 320.4061233382508. It is 1.148446953928167x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 11.00926975419417 and current value is 9.994245912863361. It is 1.1015608231156686x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 1431.2050680452421 and current value is 1299.251968672237. It is 1.1015608231156686x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 3.059486575764275 and current value is 2.7696249808203515. It is 1.1046573442076868x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 397.7332548493558 and current value is 360.05124750664567. It is 1.104657344207687x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 11.208039014005106 and current value is 12.340934612786274. It is 1.1010788414784736x worse than previous exceeding a ratio threshold 1.1

AWS-AVX2-32G-A10G-24G-Benchmark / NM_GH_ACTION_BENCHMARK

Performance alert! Previous value was 2028.231687000698 and current value is 2286.7328040001667. It is 1.1274514734466723x worse than previous exceeding a ratio threshold 1.1

Artifacts

Produced during runtime

Name	Size
3.10.12-nm-vllm-0.1.0.tar.gz Expired	404 KB
3.11.4-nm-vllm-0.1.0.tar.gz Expired	404 KB
8516883942-aws-avx2-32G-a10g-24G Expired	124 KB
gh_action_benchmark_jsons-8516883942-aws-avx2-32G-a10g-24G Expired	28.9 KB
nm_vllm-0.1.0-cp310-cp310-linux_x86_64.whl Expired	87 MB
nm_vllm-0.1.0-cp311-cp311-linux_x86_64.whl Expired	87 MB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgoin triggered nightly on refs/heads/merge-upstream-0.4.0-to-main #55

Summary

mgoin triggered nightly on refs/heads/merge-upstream-0.4.0-to-main #55

Jobs

Run details

nightly.yml

Annotations

Artifacts