Return a valid prometheus_format for multiple counters #1

gerhard · 2021-06-21T16:30:12Z

When we use the same counters for different categories (a.k.a. names), e.g. protocol AMQP091 & STREAM, prometheus_format was returning an invalid exposition format, e.g.

    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="amqp091"} 0
    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="stream"} 0

Instead of:

    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="amqp091"} 0
    rabbitmq_global_messages_received_total{protocol="stream"} 0

The previous version was invalid, as captured here: prometheus/OpenMetrics#13

This change fixes that, makes test failures a lot easier to understand, and also speeds them up by 4x:

    time make tests -> 5.240s
    time make tests -> 1.292s

There is no reason to use ct for integration tests when eunit tests will do the job. While we don't have a lot of tests today, the more we add, the slower ct gets, while eunit remains fast.

Pair @dumbbell

@dumbbell

When we use the same counters for different categories (a.k.a. names), e.g. protocol AMQP091 & STREAM, prometheus_format was returning an invalid exposition format, e.g. # TYPE rabbitmq_global_messages_received_total counter # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers rabbitmq_global_messages_received_total{protocol="amqp091"} 0 # TYPE rabbitmq_global_messages_received_total counter # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers rabbitmq_global_messages_received_total{protocol="stream"} 0 Instead of: # TYPE rabbitmq_global_messages_received_total counter # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers rabbitmq_global_messages_received_total{protocol="amqp091"} 0 rabbitmq_global_messages_received_total{protocol="stream"} 0 The above was invalid as captured here: prometheus/OpenMetrics#13 This change fixes that, makes test failures a lot easier to understand, and also speeds them up by 4x: time make tests -> 5.240s time make tests -> 1.292s There is no reason to use ct for integration tests when eunit tests will do the job. While we don't have a lot of tests today, the more we add, the slower ct gets, while eunit remains fast. Pair @dumbbell Signed-off-by: Gerhard Lazu <[email protected]>

gerhard · 2021-06-21T16:32:02Z

We may want to add some tests via GitHub Actions. For now:

make tests

 GEN    test-dir
 DEPEND seshat.d
 ERLC   seshat_counters.erl
 APP    seshat
 GEN    test-build
 GEN    eunit
.=INFO REPORT==== 21-Jun-2021::17:31:43.389447 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.398432 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.402479 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.405492 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.409498 ===
    application: seshat
    exited: stopped
    type: temporary



Top 5 slowest tests (0.001 seconds, 1.3% of total time):
  seshat_counters_server_test:fun.delete_table/0
    0.001 seconds
  seshat_counters_test:fun.overview/0
    0.000 seconds
  seshat_counters_server_test:fun.get_table/0
    0.000 seconds
  seshat_counters_test:fun.prometheus_format_multiple_names/0
    0.000 seconds
  seshat_counters_server_test:fun.get_tables/0
    0.000 seconds

Finished in 0.075 seconds
5 tests, 0 failures

@dumbbell

The rest of the details in rabbitmq/seshat#1 Pair @dumbbell Signed-off-by: Gerhard Lazu <[email protected]>

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <[email protected]>

gerhard · 2021-06-22T12:50:03Z

Thanks @dcorbacho !

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <[email protected]>

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <[email protected]>

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <[email protected]> (cherry picked from commit c797125)

gerhard requested review from dcorbacho and kjnilsson June 21, 2021 16:30

gerhard added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 21, 2021

Return a single metric type & help per global counter metric

6932dee

The rest of the details in rabbitmq/seshat#1 Pair @dumbbell Signed-off-by: Gerhard Lazu <[email protected]>

gerhard mentioned this pull request Jun 21, 2021

Global counters per protocol + protocol AND queue_type rabbitmq/rabbitmq-server#3127

Merged

14 tasks

dcorbacho approved these changes Jun 22, 2021

View reviewed changes

gerhard merged commit fc324d1 into main Jun 22, 2021

gerhard deleted the valid-prometheus-format branch June 22, 2021 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return a valid prometheus_format for multiple counters #1

Return a valid prometheus_format for multiple counters #1

gerhard commented Jun 21, 2021

gerhard commented Jun 21, 2021

gerhard commented Jun 22, 2021

Return a valid prometheus_format for multiple counters #1

Return a valid prometheus_format for multiple counters #1

Conversation

gerhard commented Jun 21, 2021

gerhard commented Jun 21, 2021

gerhard commented Jun 22, 2021