Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return a valid prometheus_format for multiple counters #1

Merged
merged 1 commit into from
Jun 22, 2021

Conversation

gerhard
Copy link
Contributor

@gerhard gerhard commented Jun 21, 2021

When we use the same counters for different categories (a.k.a. names), e.g. protocol AMQP091 & STREAM, prometheus_format was returning an invalid exposition format, e.g.

    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="amqp091"} 0
    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="stream"} 0

Instead of:

    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="amqp091"} 0
    rabbitmq_global_messages_received_total{protocol="stream"} 0

The previous version was invalid, as captured here: prometheus/OpenMetrics#13

This change fixes that, makes test failures a lot easier to understand, and also speeds them up by 4x:

    time make tests -> 5.240s
    time make tests -> 1.292s

There is no reason to use ct for integration tests when eunit tests will do the job. While we don't have a lot of tests today, the more we add, the slower ct gets, while eunit remains fast.

Pair @dumbbell

When we use the same counters for different categories (a.k.a. names),
e.g.  protocol AMQP091 & STREAM, prometheus_format was returning an
invalid exposition format, e.g.

    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="amqp091"} 0
    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="stream"} 0

Instead of:

    # TYPE rabbitmq_global_messages_received_total counter
    # HELP rabbitmq_global_messages_received_total Total number of messages received from publishers
    rabbitmq_global_messages_received_total{protocol="amqp091"} 0
    rabbitmq_global_messages_received_total{protocol="stream"} 0

The above was invalid as captured here:
prometheus/OpenMetrics#13

This change fixes that, makes test failures a lot easier to understand,
and also speeds them up by 4x:

    time make tests -> 5.240s
    time make tests -> 1.292s

There is no reason to use ct for integration tests when eunit tests will
do the job. While we don't have a lot of tests today, the more we add,
the slower ct gets, while eunit remains fast.

Pair @dumbbell

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard requested review from dcorbacho and kjnilsson June 21, 2021 16:30
@gerhard
Copy link
Contributor Author

gerhard commented Jun 21, 2021

We may want to add some tests via GitHub Actions. For now:

make tests

 GEN    test-dir
 DEPEND seshat.d
 ERLC   seshat_counters.erl
 APP    seshat
 GEN    test-build
 GEN    eunit
.=INFO REPORT==== 21-Jun-2021::17:31:43.389447 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.398432 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.402479 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.405492 ===
    application: seshat
    exited: stopped
    type: temporary

.=INFO REPORT==== 21-Jun-2021::17:31:43.409498 ===
    application: seshat
    exited: stopped
    type: temporary



Top 5 slowest tests (0.001 seconds, 1.3% of total time):
  seshat_counters_server_test:fun.delete_table/0
    0.001 seconds
  seshat_counters_test:fun.overview/0
    0.000 seconds
  seshat_counters_server_test:fun.get_table/0
    0.000 seconds
  seshat_counters_test:fun.prometheus_format_multiple_names/0
    0.000 seconds
  seshat_counters_server_test:fun.get_tables/0
    0.000 seconds

Finished in 0.075 seconds
5 tests, 0 failures

gerhard added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 21, 2021
gerhard added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 21, 2021
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard merged commit fc324d1 into main Jun 22, 2021
@gerhard gerhard deleted the valid-prometheus-format branch June 22, 2021 12:49
@gerhard
Copy link
Contributor Author

gerhard commented Jun 22, 2021

Thanks @dcorbacho !

gerhard added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 22, 2021
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <[email protected]>
gerhard added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 22, 2021
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <[email protected]>
gerhard added a commit to rabbitmq/rabbitmq-server that referenced this pull request Jun 22, 2021
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <[email protected]>
(cherry picked from commit c797125)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants