Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix success/error rate metrics #357

Merged
merged 2 commits into from
Nov 17, 2021

Conversation

linkous8
Copy link
Contributor

@linkous8 linkous8 commented Nov 2, 2021

  • Add "sum by(kubernetes_pod_name)" to success and error rate metrics so
    they are averaged by the number of pods instead of averaged by the
    number of pods times the number of "envoy_response_code_class"es

- Add "sum by(kubernetes_pod_name)" to success and error rate metrics so
they are averaged by the number of pods instead of averaged by the
number of pods times the number of "envoy_response_code_class"es
@linear
Copy link

linear bot commented Nov 2, 2021

ENG-549 Opsani Dev/Prometheus; success_rate and error_rate queries have incorrect averaging logic

Currently, the Opsani Dev connector configures success_rate and error_rate queries like such:

avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"2|3"}[1m]))
avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"4|5"}[1m]))

The expected behavior is that the avg function will provide us with the success/error rate averaged by the number of pods. However, running the query without avg demonstrates why that is not the case:

image.png

The above screenshot shows that the returned rate values are bucketed by kubernetes_pod_name AND by envoy_response_code_class which doubles the divisor used by the avg resulting in erroneous success/error rate counts when compared to total_request_rate.

The proposed solution is to wrap the rate query with sum by (kubernetes_pod_name)(…) so that the avg divisor is the number of pods as expected

Relevant metric configuration:

servo.connectors.prometheus.PrometheusMetric(
"main_success_rate",
servo.types.Unit.requests_per_second,
query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"2|3"}[1m]))',
),
servo.connectors.prometheus.PrometheusMetric(
"tuning_success_rate",
servo.types.Unit.requests_per_second,
query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role="tuning", envoy_response_code_class=~"2|3"}[1m]))',
absent=servo.connectors.prometheus.AbsentMetricPolicy.zero
),
servo.connectors.prometheus.PrometheusMetric(
"main_error_rate",
servo.types.Unit.requests_per_second,
query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role!="tuning", envoy_response_code_class=~"4|5"}[1m]))',
absent=servo.connectors.prometheus.AbsentMetricPolicy.zero
),
servo.connectors.prometheus.PrometheusMetric(
"tuning_error_rate",
servo.types.Unit.requests_per_second,
query='avg(rate(envoy_cluster_upstream_rq_xx{opsani_role="tuning", envoy_response_code_class=~"4|5"}[1m]))',
absent=servo.connectors.prometheus.AbsentMetricPolicy.zero
),

Copy link
Contributor

@rstarmer rstarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and was likley much needed!

@linkous8 linkous8 merged commit 81cba8d into main Nov 17, 2021
@linkous8 linkous8 deleted the fred/eng-549-opsani-devprometheus-success_rate-and branch November 17, 2021 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants