Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape prometheus core metric label values #9656

Merged
merged 2 commits into from
Dec 21, 2023

Conversation

gomoripeti
Copy link
Contributor

Proposed Changes

For example special characters like double quotes are allowed in queue
names, in which case detailed metrics could produce unparsable text
format output.

Fixes #9648

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

Further Comments

  • I would like to get help on the test case. Although {collect_statistics, fine} is set in init_per_group still there are no non-coarse metrics, for example the table channel_exchange_metrics is empty. I must be missing something?

  • adding escaping probably will have a slight performance impact. But I believe the main reason for using preformatted labels is to avoid the intermediate #LabelPair record representation, not to skip escaping.

@gomoripeti
Copy link
Contributor Author

hm, the new test case consistently fails for me locally, but seems like it passed in the CI. Maybe I should start from a clean repo.

@michaelklishin
Copy link
Member

@gomoripeti when in doubt, run

bazel clean
killall -9 erl; killall -9 epmd; killall -9 beam.smp

You may have stray nodes running in the background from the earlier (likely failing at first) runs.

@michaelklishin
Copy link
Member

@tvhong-amazon can you please have a look, does this escape label values the way you'd expect? Thank you.

@gomoripeti
Copy link
Contributor Author

I forgot to add the new test group to all() test groups, that is why CI tests did not fail previously.

I would like to get help on the test case. Although {collect_statistics, fine} is set in init_per_group still there are no non-coarse metrics, for example the table channel_exchange_metrics is empty. (this line fails https://github.com/rabbitmq/rabbitmq-server/pull/9656/files#diff-18cc5f075e5ea0bf5171dfb8bfb339bb021ed81bf0fa779909975cc080ce7d88R608) I must be missing something?

@tvhong-amazon
Copy link
Contributor

Thank you for the quick turn around on this! The escaped labels look correct to me.

However, I see that there are other label methods that don't call escape_label_value such as:

label(L) when is_binary(L) ->

label(M) when is_map(M) ->

label({RemoteAddress, Username, Protocol}) when is_binary(RemoteAddress), is_binary(Username),
                                                is_atom(Protocol) ->

I think we should also escape Username as it can include special characters. As for the other functions and variables, are we positive that they don't need to be escaped?

@michaelklishin
Copy link
Member

@tvhong-amazon agreed, since usernames can be generated in RabbitMQ-aaS environments. Connection names would be another obvious candidate.

@gomoripeti
Copy link
Contributor Author

thank you for the review and feedback. I thoroughly reviewed the label function again

So to sum up, I think all cases, where necessary, are escaped

@tvhong-amazon
Copy link
Contributor

@gomoripeti Thank you for double checking.

What you point out here is quite interesting:

a key-value list (not a preformatted binary) so it will be escaped by prometheus_text_format

If this is the case, would it make sense to change all other methods to return a key-value list instead of a binary? That way, we don't need to duplicate the label escaping logic in RabbitMQ.

@gomoripeti
Copy link
Contributor Author

Pre-rendering was an intentional performance optimisation in #3587

I just see this note from the PR which was a wrong assumption

Character escaping is not needed, as AMQP object names are quite restricted in what characters they can have.

But it is true if the escaping function would be exported by the prometheus lib, then we could avoid some code duplication. (I just wanted to quickly submit a fix, but that would be the proper way)

@@ -448,6 +450,25 @@ label(A) when is_atom(A) ->
is_protocol(P) ->
lists:member(P, [amqp091, amqp10, mqtt, http]).

%% Escape functions taken from prometheus_text_format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you copied the functions instead of just calling the ones in prometheus_text_format? I think they are exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@michaelklishin michaelklishin Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome to contribute to prometheus.erl, it is maintained by a RabbitMQ core team member

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was poking around, decided to create relevant PR in prometheus.erl to move that export out of the TEST def -- PR here in deadtrickster/prometheus.erl#158

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! will update this PR when that is merged

@binarin
Copy link
Contributor

binarin commented Oct 24, 2023

The reasoning behind removing escaping was that via AMQP 0.9.1 it's not possible to create queue/exchange/... that needs escaping, and this was one of the worst performance offenders. So removing it seemed reasonable. You'll need to check that performance (and memory usage) is still acceptable with escaping restored - I suggest measuring the time it takes to generate stats on a broker with 10000 of exchanges/queues/channels each.

@gomoripeti
Copy link
Contributor Author

thanks @binarin for the historic context. I assumed the intermediate record representation was the main concern.
Maybe escaping could be optimised to first just scan through the binary if it has any special chars, and only copy-and-escape if needed. This would avoid copying the binary in most cases. Will measure...

@gomoripeti
Copy link
Contributor Author

I did some benchmarks of escaping. The baseline is RabbitMQ 3.12.6 unchanged. The "escape" version is using the unchanged escape_lable_value. The "escape-opt" is the version that is visible in deadtrickster/prometheus.erl#160.

I first did a quick microbenchmark of directly calling escape_label_value. Calling it 1M times with random strings of length 1..80 and no special characters took about 2 seconds. Calling the optimised version took about 0.2 seconds.

Then I made the following benchmark: Created 1 connection with 10K channels. Each creating 1 exchange, 1 queue, subscribed to the queue and published 1 message to the exchange. Queue and exchange names were in the form "q_<index>" ie 3-7 characters long.

Object counts:
Connections: 2
Channels: 10001
Exchanges: 10020
Queues: 10001
Consumers: 10001

Used below command to query prometheus metrics and measure duration:

$ curl -s -w 'Total: %{time_total}s\n' -u ${AUTH} https://${BROKER}/metrics/per-object -o metrics.log

Number of different labels in the output:

$ grep -c 'queue=' metrics.log; grep -c 'exchange=' metrics.log; grep -c 'vhost=' metrics.log; grep -c 'channel=' metrics.log
300029
50000
340029
200022

Measured 5 times the total time of the HTTP call as well as the function call prometheus_rabbitmq_core_metrics_collector:collect_mf/2

Results

collect_mf/2 avg (sec) min (sec) max (sec) samples
baseline 2.0873628 1.819755 2.691796 2.691796,1.987629,1.819755,2.000737,1.936897
escape 4.2683968 3.174599 5.134498 5.134498,4.835701,4.117033,4.080153,3.174599
escape-opt 2.6438805999999997 2.103326 3.519962 3.519962,3.234765,2.167548,2.103326,2.193802
HTTP duration samples
basline 2.887208s, 2.180847s, 1.981343s, 2.183184s, 2.081882s
escape 5.914040s, 5.069289s, 4.302007s, 4.246539s, 3.364625s
escape-opt 3.730437s, 3.402486s, 2.350659s, 2.315713s, 2.349219s

@gomoripeti gomoripeti force-pushed the prometheus_escape_label branch from 3c87f94 to c283744 Compare November 9, 2023 18:04
@mergify mergify bot added the make label Nov 9, 2023
@gomoripeti gomoripeti force-pushed the prometheus_escape_label branch 2 times, most recently from 375b12f to 2a7a2ca Compare November 15, 2023 18:12
@mergify mergify bot added the bazel label Nov 15, 2023
@gomoripeti gomoripeti force-pushed the prometheus_escape_label branch from 2a7a2ca to d236e69 Compare November 15, 2023 22:37
@deadtrickster
Copy link
Contributor

tagged prometheus.erl 4.11

@gomoripeti gomoripeti force-pushed the prometheus_escape_label branch from d236e69 to 4d8e5d2 Compare November 23, 2023 14:54
@gomoripeti
Copy link
Contributor Author

thanks a lot, Iliia!

For example special characters like double quotes are allowed in queue
names, in which case detailed metrics could produce unparsable text
format output.
@gomoripeti gomoripeti force-pushed the prometheus_escape_label branch from 4d8e5d2 to 8c78760 Compare December 3, 2023 00:14
@gomoripeti
Copy link
Contributor Author

this might have been forgotten (I did forget about it)

this PR is ready for review and merge.

@michaelklishin
Copy link
Member

@tvhong-amazon @SimonUnge @illotum this is a PR that seeks to address an issue your team has reported. Can you please give it a shot?

@illotum
Copy link

illotum commented Dec 20, 2023

I ran several (naive) benchmarks and manual tests. LGTM.

1000 bad queue names:

~/r/s/d/r/bin (prometheus_escape_label)> hyperfine -w 100 "curl localhost:15692/metrics/per-object"
Benchmark 1: curl localhost:15692/metrics/per-object
  Time (mean ± σ):      53.6 ms ±   5.7 ms    [User: 5.5 ms, System: 9.7 ms]
  Range (min … max):    47.7 ms …  72.4 ms    44 runs

~/r/s/d/r/bin (main)> hyperfine -w 100 "curl localhost:15692/metrics/per-object"
Benchmark 1: curl localhost:15692/metrics/per-object
  Time (mean ± σ):      50.9 ms ±   3.1 ms    [User: 5.3 ms, System: 8.8 ms]
  Range (min … max):    47.4 ms …  64.5 ms    44 runs
 

@michaelklishin michaelklishin merged commit 54ae406 into rabbitmq:main Dec 21, 2023
8 checks passed
michaelklishin added a commit that referenced this pull request Dec 21, 2023
michaelklishin added a commit that referenced this pull request Dec 21, 2023
Escape prometheus core metric label values (backport #9656)
mergify bot pushed a commit that referenced this pull request Dec 21, 2023
(cherry picked from commit 8224673)
michaelklishin added a commit that referenced this pull request Dec 21, 2023
Escape prometheus core metric label values (backport #9656) (backport #10195)
@gomoripeti gomoripeti deleted the prometheus_escape_label branch February 12, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus label values are not escaped correctly
8 participants