Missing prometheus metrics in `0.3.0` #2850

SamComber · 2024-02-13T16:28:23Z

First of all, thanks for the great open source library!

The docs promise a few more additional metrics that I'm not seeing in vLLM 0.3.0, have these been removed? I.e. if I hit /metrics of the OpenAI API server for a deployed model... you'll see no vllm:time_to_first_token_seconds or vllm:time_per_output_token_seconds or vllm:e2e_request_latency_seconds

The text was updated successfully, but these errors were encountered:

SamComber · 2024-02-13T19:06:14Z

Just realised the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.

Would it be possible to push another image version to docker hub with the updates?

https://hub.docker.com/r/vllm/vllm-openai/tags

robertgshaw2-redhat · 2024-02-14T14:52:25Z

I think a new release will be pushed soon -> #2859

grandiose-pizza · 2024-03-31T13:42:39Z

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.

Would it be possible to push another image ver

Hi,

@SamComber
I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

hmellor · 2024-03-31T14:06:34Z

@grandiose-pizza did you start your server with --disable-log-stats? That will prevent the Prometheus metrics from being updated.

grandiose-pizza · 2024-03-31T14:09:51Z

@hmellor , no It is set to false while starting:

INFO worker.py:1752 -- Started a local Ray instance.
ens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=True, disable_log_requests=False, max_log_len=None)

Do I need to add anything to this line?

vllm/vllm/entrypoints/openai/api_server.py

Line 57 in 563c1d7

metrics_app = make_asgi_app()

hmellor · 2024-03-31T14:26:00Z

Also it's worth noting that what you're seeing is different because the original screenshot was taken before we switched from aioprometheus (third party) to prometheus_client (first party).

grandiose-pizza · 2024-03-31T14:41:33Z

Could you please share what is expected while using prometheus_client instead?

Is it different than the comment above?
#2850 (comment)

hmellor · 2024-03-31T14:52:39Z

Changing Prometheus client packages only changes the non-vllm:... metrics, which is what you observed.

The vllm:... metrics should be unchanged.

grandiose-pizza · 2024-03-31T14:55:44Z

It is quite strange. Trying to figure how to obtain the stats like here:

vllm/vllm/engine/metrics.py

Line 20 in 563c1d7

class Metrics:

yabea · 2024-04-01T07:46:04Z

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.
Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I have encountered the same issue as well. If you have resolved it, Pleasel let me know. Thank you.

kalpesh22-21 · 2024-06-13T21:44:40Z

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.
Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I am facing same issue

leokster · 2024-08-19T10:09:58Z

is there any update or workaround for this issue?

pseudotensor · 2024-08-22T07:15:00Z

Seeing same thing, only basic stats in metrics, no usage, and promethus is not being populated.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 156170.0
python_gc_objects_collected_total{generation="1"} 180292.0
python_gc_objects_collected_total{generation="2"} 114521.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 2102.0
python_gc_collections_total{generation="1"} 191.0
python_gc_collections_total{generation="2"} 10.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.4693138432e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.168400384e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72430453209e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 59.7
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I think maybe broken in 0.5.4.

On SAME host system also running 0.5.4, just different model, I get more stuff:

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 3.74064e+07
python_gc_objects_collected_total{generation="1"} 3.649437e+06
python_gc_objects_collected_total{generation="2"} 157913.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 63451.0
python_gc_collections_total{generation="1"} 5766.0
python_gc_collections_total{generation="2"} 105.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.34454657024e+011
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.4426368e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72178452291e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 19123.94
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 79.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.95",num_cpu_blocks="1638",num_gpu_blocks="16334",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 7.5344734e+07
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 954848.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.25",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.75",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="7.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1.7057619094848633
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14.214749813079834
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14153.0
vllm:e2e_request_latency_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16216.0
vllm:e2e_request_latency_seconds_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17117.0
vllm:e2e_request_latency_seconds_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17400.0
vllm:e2e_request_latency_seconds_bucket{le="15.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17473.0
vllm:e2e_request_latency_seconds_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17484.0
vllm:e2e_request_latency_seconds_bucket{le="30.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="40.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="60.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 15472.243278980255
# HELP vllm:request_prompt_tokens Number of prefill tokens processed.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
vllm:request_prompt_tokens_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
vllm:request_prompt_tokens_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 2.0
vllm:request_prompt_tokens_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 766.0
vllm:request_prompt_tokens_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 976.0
vllm:request_prompt_tokens_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1829.0
vllm:request_prompt_tokens_bucket{le="100.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1954.0
vllm:request_prompt_tokens_bucket{le="200.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 3540.0
vllm:request_prompt_tokens_bucket{le="500.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 4995.0
vllm:request_prompt_tokens_bucket{le="1000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 5979.0
vllm:request_prompt_tokens_bucket{le="2000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 8386.0
vllm:request_prompt_tokens_bucket{le="5000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 12431.0
vllm:request_prompt_tokens_bucket{le="10000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14606.0
vllm:request_prompt_tokens_bucket{le="20000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17136.0
vllm:request_prompt_tokens_bucket{le="50000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17466.0
vllm:request_prompt_tokens_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_prompt_tokens_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_prompt_tokens_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 7.5331484e+07
# HELP vllm:request_generation_tokens Number of generation tokens processed.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 18.0
vllm:request_generation_tokens_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 45.0
vllm:request_generation_tokens_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 66.0
vllm:request_generation_tokens_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 2394.0
vllm:request_generation_tokens_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 5039.0
vllm:request_generation_tokens_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 15113.0
vllm:request_generation_tokens_bucket{le="100.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16031.0
vllm:request_generation_tokens_bucket{le="200.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16491.0
vllm:request_generation_tokens_bucket{le="500.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17257.0
vllm:request_generation_tokens_bucket{le="1000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17436.0
vllm:request_generation_tokens_bucket{le="2000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17488.0
vllm:request_generation_tokens_bucket{le="5000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="10000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="20000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="50000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 954675.0
# HELP vllm:request_params_best_of Histogram of the best_of request parameter.
# TYPE vllm:request_params_best_of histogram
vllm:request_params_best_of_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
# HELP vllm:request_params_n Histogram of the n request parameter.
# TYPE vllm:request_params_n histogram
vllm:request_params_n_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 965.0
vllm:request_success_total{finished_reason="stop",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16530.0
# HELP vllm:spec_decode_draft_acceptance_rate Speulative token acceptance rate.
# TYPE vllm:spec_decode_draft_acceptance_rate gauge
# HELP vllm:spec_decode_efficiency Speculative decoding system efficiency.
# TYPE vllm:spec_decode_efficiency gauge
# HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
# TYPE vllm:spec_decode_num_accepted_tokens_total counter
# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
# TYPE vllm:spec_decode_num_draft_tokens_total counter
# HELP vllm:spec_decode_num_emitted_tokens_total Number of emitted tokens.
# TYPE vllm:spec_decode_num_emitted_tokens_total counter
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0

is it possible that some models do not support those other metrics?

pseudotensor · 2024-08-22T07:15:33Z

@hmellor Why was this issue closed as not planned? It seems like clearly a bug for a useful thing.

robertgshaw2-redhat · 2024-08-22T12:13:25Z

We implemented Prometheus metrics in v0.3.1
There is a bug in v0.5.4 re: Prometheus due to the multiprocessing in the openai server. Use —disable-frontend-multiprocessing to get around it
This is fixed on main

hmellor · 2024-08-27T09:18:40Z

@pseudotensor Annoyingly, "not planned" can mean many things (why we can't specify which thing, I don't know), but this was closed as stale originally.

pseudotensor · 2024-08-27T16:51:34Z

No problem, it's all working in main. Thanks!

grandiose-pizza mentioned this issue Mar 31, 2024

Add latency metrics #1870

Closed

AllenDou mentioned this issue Apr 9, 2024

[Bugfix] Fix vllm metrics disappeared when --engine-use-ray is enabled #3938

Closed

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing prometheus metrics in `0.3.0` #2850

Missing prometheus metrics in `0.3.0` #2850

SamComber commented Feb 13, 2024

SamComber commented Feb 13, 2024 •

edited

Loading

robertgshaw2-redhat commented Feb 14, 2024

grandiose-pizza commented Mar 31, 2024

hmellor commented Mar 31, 2024

grandiose-pizza commented Mar 31, 2024 •

edited

Loading

hmellor commented Mar 31, 2024

grandiose-pizza commented Mar 31, 2024

hmellor commented Mar 31, 2024 •

edited

Loading

grandiose-pizza commented Mar 31, 2024

yabea commented Apr 1, 2024

kalpesh22-21 commented Jun 13, 2024 •

edited

Loading

leokster commented Aug 19, 2024

pseudotensor commented Aug 22, 2024 •

edited

Loading

pseudotensor commented Aug 22, 2024

robertgshaw2-redhat commented Aug 22, 2024

hmellor commented Aug 27, 2024

pseudotensor commented Aug 27, 2024

Missing prometheus metrics in 0.3.0 #2850

Missing prometheus metrics in 0.3.0 #2850

Comments

SamComber commented Feb 13, 2024

SamComber commented Feb 13, 2024 • edited Loading

robertgshaw2-redhat commented Feb 14, 2024

grandiose-pizza commented Mar 31, 2024

hmellor commented Mar 31, 2024

grandiose-pizza commented Mar 31, 2024 • edited Loading

hmellor commented Mar 31, 2024

grandiose-pizza commented Mar 31, 2024

hmellor commented Mar 31, 2024 • edited Loading

grandiose-pizza commented Mar 31, 2024

yabea commented Apr 1, 2024

kalpesh22-21 commented Jun 13, 2024 • edited Loading

leokster commented Aug 19, 2024

pseudotensor commented Aug 22, 2024 • edited Loading

pseudotensor commented Aug 22, 2024

robertgshaw2-redhat commented Aug 22, 2024

hmellor commented Aug 27, 2024

pseudotensor commented Aug 27, 2024

Missing prometheus metrics in `0.3.0` #2850

Missing prometheus metrics in `0.3.0` #2850

SamComber commented Feb 13, 2024 •

edited

Loading

grandiose-pizza commented Mar 31, 2024 •

edited

Loading

hmellor commented Mar 31, 2024 •

edited

Loading

kalpesh22-21 commented Jun 13, 2024 •

edited

Loading

pseudotensor commented Aug 22, 2024 •

edited

Loading