Fix success/error rate metrics #357

linkous8 · 2021-11-02T18:28:37Z

Add "sum by(kubernetes_pod_name)" to success and error rate metrics so
they are averaged by the number of pods instead of averaged by the
number of pods times the number of "envoy_response_code_class"es

- Add "sum by(kubernetes_pod_name)" to success and error rate metrics so they are averaged by the number of pods instead of averaged by the number of pods times the number of "envoy_response_code_class"es

linear · 2021-11-02T18:28:39Z

ENG-549 Opsani Dev/Prometheus; success_rate and error_rate queries have incorrect averaging logic

Currently, the Opsani Dev connector configures success_rate and error_rate queries like such:

avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"2|3"}[1m]))
avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"4|5"}[1m]))

The expected behavior is that the avg function will provide us with the success/error rate averaged by the number of pods. However, running the query without avg demonstrates why that is not the case:

The above screenshot shows that the returned rate values are bucketed by kubernetes_pod_name AND by envoy_response_code_class which doubles the divisor used by the avg resulting in erroneous success/error rate counts when compared to total_request_rate.

The proposed solution is to wrap the rate query with sum by (kubernetes_pod_name)(…) so that the avg divisor is the number of pods as expected

Relevant metric configuration:

servox/servo/connectors/opsani_dev.py

Lines 172 to 194 in 44b1dc2

    
           servo.connectors.prometheus.PrometheusMetric( 
        
               "main_success_rate", 
        
               servo.types.Unit.requests_per_second, 
        
               query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"2|3"}[1m]))', 
        
           ), 
        
           servo.connectors.prometheus.PrometheusMetric( 
        
               "tuning_success_rate", 
        
               servo.types.Unit.requests_per_second, 
        
               query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role="tuning", envoy_response_code_class=~"2|3"}[1m]))', 
        
               absent=servo.connectors.prometheus.AbsentMetricPolicy.zero 
        
           ), 
        
           servo.connectors.prometheus.PrometheusMetric( 
        
               "main_error_rate", 
        
               servo.types.Unit.requests_per_second, 
        
               query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"4|5"}[1m]))', 
        
               absent=servo.connectors.prometheus.AbsentMetricPolicy.zero 
        
           ), 
        
           servo.connectors.prometheus.PrometheusMetric( 
        
               "tuning_error_rate", 
        
               servo.types.Unit.requests_per_second, 
        
               query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role="tuning", envoy_response_code_class=~"4|5"}[1m]))', 
        
               absent=servo.connectors.prometheus.AbsentMetricPolicy.zero 
        
           ),

rstarmer

This is great, and was likley much needed!

…te-and

Fix success/error rate metrics

693139a

- Add "sum by(kubernetes_pod_name)" to success and error rate metrics so they are averaged by the number of pods instead of averaged by the number of pods times the number of "envoy_response_code_class"es

linkous8 requested review from rstarmer, blakewatters and 30blows November 2, 2021 18:55

rstarmer approved these changes Nov 10, 2021

View reviewed changes

Merge branch 'main' into fred/eng-549-opsani-devprometheus-success_ra…

a8b9d0a

…te-and

linkous8 merged commit 81cba8d into main Nov 17, 2021

linkous8 deleted the fred/eng-549-opsani-devprometheus-success_rate-and branch November 17, 2021 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix success/error rate metrics #357

Fix success/error rate metrics #357

linkous8 commented Nov 2, 2021

linear bot commented Nov 2, 2021

rstarmer left a comment

Fix success/error rate metrics #357

Fix success/error rate metrics #357

Conversation

linkous8 commented Nov 2, 2021

linear bot commented Nov 2, 2021

rstarmer left a comment

Choose a reason for hiding this comment