Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing prometheus metrics in 0.3.0 #2850

Closed
SamComber opened this issue Feb 13, 2024 · 17 comments
Closed

Missing prometheus metrics in 0.3.0 #2850

SamComber opened this issue Feb 13, 2024 · 17 comments

Comments

@SamComber
Copy link

First of all, thanks for the great open source library!

The docs promise a few more additional metrics that I'm not seeing in vLLM 0.3.0, have these been removed? I.e. if I hit /metrics of the OpenAI API server for a deployed model... you'll see no vllm:time_to_first_token_seconds or vllm:time_per_output_token_seconds or vllm:e2e_request_latency_seconds

image
@SamComber
Copy link
Author

SamComber commented Feb 13, 2024

Just realised the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.

Would it be possible to push another image version to docker hub with the updates?

https://hub.docker.com/r/vllm/vllm-openai/tags

@robertgshaw2-redhat
Copy link
Collaborator

I think a new release will be pushed soon -> #2859

@grandiose-pizza
Copy link
Contributor

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.

Would it be possible to push another image ver

Hi,

@SamComber
I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

@grandiose-pizza did you start your server with --disable-log-stats? That will prevent the Prometheus metrics from being updated.

@grandiose-pizza
Copy link
Contributor

grandiose-pizza commented Mar 31, 2024

@hmellor , no It is set to false while starting:

INFO worker.py:1752 -- Started a local Ray instance.
ens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=True, disable_log_requests=False, max_log_len=None)

Do I need to add anything to this line?

metrics_app = make_asgi_app()

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

Also it's worth noting that what you're seeing is different because the original screenshot was taken before we switched from aioprometheus (third party) to prometheus_client (first party).

@grandiose-pizza
Copy link
Contributor

Could you please share what is expected while using prometheus_client instead?

Is it different than the comment above?
#2850 (comment)

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

Changing Prometheus client packages only changes the non-vllm:... metrics, which is what you observed.

The vllm:... metrics should be unchanged.

@grandiose-pizza
Copy link
Contributor

It is quite strange. Trying to figure how to obtain the stats like here:

class Metrics:

@yabea
Copy link

yabea commented Apr 1, 2024

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.
Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I have encountered the same issue as well. If you have resolved it, Pleasel let me know. Thank you.

@kalpesh22-21
Copy link

kalpesh22-21 commented Jun 13, 2024

d the image I'm pulling for the deployment uses vllm/engine/metrics.py from v0.3.0, not the tip of main.
Would it be possible to push another image ver

Hi,

@SamComber I want to use the metrics but I see something completely different. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I am facing same issue

@hmellor hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2024
@leokster
Copy link

is there any update or workaround for this issue?

@pseudotensor
Copy link

pseudotensor commented Aug 22, 2024

Seeing same thing, only basic stats in metrics, no usage, and promethus is not being populated.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 156170.0
python_gc_objects_collected_total{generation="1"} 180292.0
python_gc_objects_collected_total{generation="2"} 114521.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 2102.0
python_gc_collections_total{generation="1"} 191.0
python_gc_collections_total{generation="2"} 10.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.4693138432e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.168400384e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72430453209e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 59.7
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

I think maybe broken in 0.5.4.

On SAME host system also running 0.5.4, just different model, I get more stuff:

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 3.74064e+07
python_gc_objects_collected_total{generation="1"} 3.649437e+06
python_gc_objects_collected_total{generation="2"} 157913.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 63451.0
python_gc_collections_total{generation="1"} 5766.0
python_gc_collections_total{generation="2"} 105.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.34454657024e+011
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.4426368e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72178452291e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 19123.94
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 79.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.95",num_cpu_blocks="1638",num_gpu_blocks="16334",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 7.5344734e+07
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 954848.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.25",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="0.75",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="7.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17499.0
vllm:time_to_first_token_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1.7057619094848633
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 937349.0
vllm:time_per_output_token_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14.214749813079834
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14153.0
vllm:e2e_request_latency_seconds_bucket{le="2.5",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16216.0
vllm:e2e_request_latency_seconds_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17117.0
vllm:e2e_request_latency_seconds_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17400.0
vllm:e2e_request_latency_seconds_bucket{le="15.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17473.0
vllm:e2e_request_latency_seconds_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17484.0
vllm:e2e_request_latency_seconds_bucket{le="30.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="40.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="60.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:e2e_request_latency_seconds_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 15472.243278980255
# HELP vllm:request_prompt_tokens Number of prefill tokens processed.
# TYPE vllm:request_prompt_tokens histogram
vllm:request_prompt_tokens_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
vllm:request_prompt_tokens_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
vllm:request_prompt_tokens_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 2.0
vllm:request_prompt_tokens_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 766.0
vllm:request_prompt_tokens_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 976.0
vllm:request_prompt_tokens_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1829.0
vllm:request_prompt_tokens_bucket{le="100.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 1954.0
vllm:request_prompt_tokens_bucket{le="200.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 3540.0
vllm:request_prompt_tokens_bucket{le="500.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 4995.0
vllm:request_prompt_tokens_bucket{le="1000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 5979.0
vllm:request_prompt_tokens_bucket{le="2000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 8386.0
vllm:request_prompt_tokens_bucket{le="5000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 12431.0
vllm:request_prompt_tokens_bucket{le="10000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 14606.0
vllm:request_prompt_tokens_bucket{le="20000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17136.0
vllm:request_prompt_tokens_bucket{le="50000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17466.0
vllm:request_prompt_tokens_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_prompt_tokens_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_prompt_tokens_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 7.5331484e+07
# HELP vllm:request_generation_tokens Number of generation tokens processed.
# TYPE vllm:request_generation_tokens histogram
vllm:request_generation_tokens_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 18.0
vllm:request_generation_tokens_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 45.0
vllm:request_generation_tokens_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 66.0
vllm:request_generation_tokens_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 2394.0
vllm:request_generation_tokens_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 5039.0
vllm:request_generation_tokens_bucket{le="50.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 15113.0
vllm:request_generation_tokens_bucket{le="100.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16031.0
vllm:request_generation_tokens_bucket{le="200.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16491.0
vllm:request_generation_tokens_bucket{le="500.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17257.0
vllm:request_generation_tokens_bucket{le="1000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17436.0
vllm:request_generation_tokens_bucket{le="2000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17488.0
vllm:request_generation_tokens_bucket{le="5000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="10000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="20000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="50000.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_generation_tokens_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 954675.0
# HELP vllm:request_params_best_of Histogram of the best_of request parameter.
# TYPE vllm:request_params_best_of histogram
vllm:request_params_best_of_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_best_of_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
# HELP vllm:request_params_n Histogram of the n request parameter.
# TYPE vllm:request_params_n histogram
vllm:request_params_n_bucket{le="1.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="2.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="5.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="10.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="20.0",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_bucket{le="+Inf",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_count{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
vllm:request_params_n_sum{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 17495.0
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 965.0
vllm:request_success_total{finished_reason="stop",model_name="mistralai/Mistral-Nemo-Instruct-2407"} 16530.0
# HELP vllm:spec_decode_draft_acceptance_rate Speulative token acceptance rate.
# TYPE vllm:spec_decode_draft_acceptance_rate gauge
# HELP vllm:spec_decode_efficiency Speculative decoding system efficiency.
# TYPE vllm:spec_decode_efficiency gauge
# HELP vllm:spec_decode_num_accepted_tokens_total Number of accepted tokens.
# TYPE vllm:spec_decode_num_accepted_tokens_total counter
# HELP vllm:spec_decode_num_draft_tokens_total Number of draft tokens.
# TYPE vllm:spec_decode_num_draft_tokens_total counter
# HELP vllm:spec_decode_num_emitted_tokens_total Number of emitted tokens.
# TYPE vllm:spec_decode_num_emitted_tokens_total counter
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="mistralai/Mistral-Nemo-Instruct-2407"} 0.0

is it possible that some models do not support those other metrics?

@pseudotensor
Copy link

@hmellor Why was this issue closed as not planned? It seems like clearly a bug for a useful thing.

@robertgshaw2-redhat
Copy link
Collaborator

  • We implemented Prometheus metrics in v0.3.1
  • There is a bug in v0.5.4 re: Prometheus due to the multiprocessing in the openai server. Use —disable-frontend-multiprocessing to get around it
  • This is fixed on main

@hmellor
Copy link
Collaborator

hmellor commented Aug 27, 2024

@pseudotensor Annoyingly, "not planned" can mean many things (why we can't specify which thing, I don't know), but this was closed as stale originally.

image

@pseudotensor
Copy link

No problem, it's all working in main. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants