Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry documentation: Lease Expiry Metrics #10377

Merged
merged 5 commits into from
Nov 23, 2020
Merged

Conversation

HridoyRoy
Copy link
Contributor

Related to: #10375

@HridoyRoy HridoyRoy requested a review from mgritter November 11, 2020 01:18
@@ -104,6 +104,7 @@ These metrics represent operational aspects of the running Vault instance.
| `vault.core.step_down` | Duration of time taken by cluster leadership step downs. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
| `vault.core.unseal` | Duration of time taken by unseal operations | ms | summary |
| `vault.core.unsealed` | Has value 1 when Vault is unsealed, and 0 when Vault is sealed. | bool | gauge |
| `vault.expire.leases.by_expiration` | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via `lease_metrics_epsilon` and `num_lease_metrics_buckets` in the telemetry stanza of a vault server configuration. One can additionally group lease expiration by namespace by setting `add_lease_metrics_namespace_labels` to `true` in the config file. | objects | gauge |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggestions here:

  • list the label names here (cluster, whatever you called the bucket) like the other labelled metrics.
  • specify the default interval, since that it is what most people will be dealing with
  • link to the telemetry stanza documentation? (Should that be part of this PR?)

There should be a "leases" section, with the other vault.expire metrics, please put it there.

Copy link
Contributor

@mgritter mgritter Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"object" can be "leases" to match vault.expire.num_leases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the new change look?

Comment on lines 40 to 41
- `lease_metrics_epsilon` `(string: "1hr")` - Specifies the time difference between collections
of leases by expiry time. For example, for the default value of 1 hour, the `vault.expire.leases.by_expiration`
Copy link
Contributor

@mgritter mgritter Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, "time difference between collections of leases" makes it sounds like it is the interval at which the collection is done, not the interval of the underlying data.

How about something like "the size of the bucket used to measure future lease expiration?" Or "the granularity of measurement for lease expiration"?

Comment on lines 44 to 45
lease A expires 25 minutes from now, and lease B expires 35 minutes from now, then lease A will be in a
different bucket from lease B.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice this, and it's somewhat surprising to me. Maybe you could be more explicit in your example and say something like:

For the default 1 hour buckets, the first bucket consists of leases expiring between 0 and 29 minutes in the future, the second bucket consists of leases expiring between 30 and 99 minutes in the future, etc.

Rather than just stating that they are "different buckets".

(Alternatively, it's not too late to reconsider and round down so that bucket 0 is "0-59 minutes", bucket 1 is "60-119 minutes", etc.)

Copy link
Contributor Author

@HridoyRoy HridoyRoy Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do floors instead of rounds (so 0-59 --> bucket 0, 60-119 --> bucket 1, etc.), but I feel rounding would be more descriptive. For example, if 100 leases expire at minute 55 and 100 at minute 65, I as a customer would probably want the metric to report 200 in bucket 1, and 0 in bucket 0. But I can see it going either way.

Definitely, can rewrite the example to be more explicit!

different bucket from lease B.
- `num_lease_metrics_buckets` `(int: 168)` - The number of expiry buckets for leases. For the default value for
example, 168 value labels for the `vault.expire.leases.by_expiration` metric will be reported, where each value
has a time difference from the previous value that is specified by the `lease_metrics_epsilon` parameter. For the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has a time different from the previous value that is specified by the lease_metrics_epsilon parameter"

=>

"each bucket is separated in time by the lease_metrics_epsilon parameter"?

- `add_lease_metrics_namespace_labels` `(bool: false)` - If this value is set to true, then `vault.expire.leases.by_expiration` will break down
expiring leases by both time and namespace. For example, if x labels are expiring within one time bucket T specified by `lease_metrics_epsilon`
and 40% of those leases are in namespace a and 60% of those leases are in namespace b, then having this value set to true will yield two
different label sets -- one with value .4x with labels T and a, and another with value .6x and labels T and b.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found the switch from 40% to .4x confusing and had to re-read.

I think the example is probably not necessary; what I'd prefer to say is that this is disabled by default because enabling it can lead to a large-cardinality metric.

@@ -151,7 +150,8 @@ These metrics cover measurement of token, identity, and lease operations, and co
| `vault.expire.fetch-lease-times` | Time taken to fetch lease times | ms | summary |
| `vault.expire.fetch-lease-times-by-token` | Time taken to fetch lease times by token | ms | summary |
| `vault.expire.num_leases` | Number of all leases which are eligible for eventual expiry | leases | gauge |
| `vault.expire.lease_expiration | Count of lease expirations | leases | counter |
| `vault.expire.leases.by_expiration` (cluster,gauge,expiring,namespace) | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via `lease_metrics_epsilon` and `num_lease_metrics_buckets` in the telemetry stanza of a vault server configuration. The default values for these are `1hr` and `168` respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can additionally group lease expiration by namespace by setting `add_lease_metrics_namespace_labels` to `true` in the config file (default is `false`). | leases | gauge |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! :)

Comment on lines +47 to +51
- `num_lease_metrics_buckets` `(int: 168)` - The number of expiry buckets for leases. For the default value, for
example, 168 value labels for the `vault.expire.leases.by_expiration` metric will be reported, where each value
each bucket is separated in time by the `lease_metrics_epsilon` parameter. For the default 1 hour value of
`lease_metrics_epsilon` and the default value of `num_lease_metrics_buckets`, `vault.expire.leases.by_expiration`
will report the total number of leases expiring within each hour from the current time to one week from the current time.
Copy link
Contributor

@mgritter mgritter Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like "For the default value, for example, 168 value labels..." is awkward.

Suggested change
- `num_lease_metrics_buckets` `(int: 168)` - The number of expiry buckets for leases. For the default value, for
example, 168 value labels for the `vault.expire.leases.by_expiration` metric will be reported, where each value
each bucket is separated in time by the `lease_metrics_epsilon` parameter. For the default 1 hour value of
`lease_metrics_epsilon` and the default value of `num_lease_metrics_buckets`, `vault.expire.leases.by_expiration`
will report the total number of leases expiring within each hour from the current time to one week from the current time.
- `num_lease_metrics_buckets` `(int: 168)` - Controls the number of buckets reported by the
`vault.expire.leases.by_expiration` metric, where each bucket is separated in time by the
`lease_metrics_epsilon` parameter. For the default 1 hour value of `lease_metrics_epsilon`
and the default value of `num_lease_metrics_buckets`, `vault.expire.leases.by_expiration` will report the total
number of leases expiring within each hour from the current time to one week from the current time.

@HridoyRoy HridoyRoy merged commit c101a91 into master Nov 23, 2020
@HridoyRoy HridoyRoy deleted the lease-exp-telemetry branch November 23, 2020 19:06
kalafut pushed a commit that referenced this pull request Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants