-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry documentation: Lease Expiry Metrics #10377
Conversation
@@ -104,6 +104,7 @@ These metrics represent operational aspects of the running Vault instance. | |||
| `vault.core.step_down` | Duration of time taken by cluster leadership step downs. This should be monitored and alerted on for overall cluster leadership status. | ms | summary | | |||
| `vault.core.unseal` | Duration of time taken by unseal operations | ms | summary | | |||
| `vault.core.unsealed` | Has value 1 when Vault is unsealed, and 0 when Vault is sealed. | bool | gauge | | |||
| `vault.expire.leases.by_expiration` | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via `lease_metrics_epsilon` and `num_lease_metrics_buckets` in the telemetry stanza of a vault server configuration. One can additionally group lease expiration by namespace by setting `add_lease_metrics_namespace_labels` to `true` in the config file. | objects | gauge | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple suggestions here:
- list the label names here (cluster, whatever you called the bucket) like the other labelled metrics.
- specify the default interval, since that it is what most people will be dealing with
- link to the telemetry stanza documentation? (Should that be part of this PR?)
There should be a "leases" section, with the other vault.expire metrics, please put it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"object" can be "leases" to match vault.expire.num_leases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the new change look?
- `lease_metrics_epsilon` `(string: "1hr")` - Specifies the time difference between collections | ||
of leases by expiry time. For example, for the default value of 1 hour, the `vault.expire.leases.by_expiration` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, "time difference between collections of leases" makes it sounds like it is the interval at which the collection is done, not the interval of the underlying data.
How about something like "the size of the bucket used to measure future lease expiration?" Or "the granularity of measurement for lease expiration"?
lease A expires 25 minutes from now, and lease B expires 35 minutes from now, then lease A will be in a | ||
different bucket from lease B. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice this, and it's somewhat surprising to me. Maybe you could be more explicit in your example and say something like:
For the default 1 hour buckets, the first bucket consists of leases expiring between 0 and 29 minutes in the future, the second bucket consists of leases expiring between 30 and 99 minutes in the future, etc.
Rather than just stating that they are "different buckets".
(Alternatively, it's not too late to reconsider and round down so that bucket 0 is "0-59 minutes", bucket 1 is "60-119 minutes", etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to do floors instead of rounds (so 0-59 --> bucket 0, 60-119 --> bucket 1, etc.), but I feel rounding would be more descriptive. For example, if 100 leases expire at minute 55 and 100 at minute 65, I as a customer would probably want the metric to report 200 in bucket 1, and 0 in bucket 0. But I can see it going either way.
Definitely, can rewrite the example to be more explicit!
different bucket from lease B. | ||
- `num_lease_metrics_buckets` `(int: 168)` - The number of expiry buckets for leases. For the default value for | ||
example, 168 value labels for the `vault.expire.leases.by_expiration` metric will be reported, where each value | ||
has a time difference from the previous value that is specified by the `lease_metrics_epsilon` parameter. For the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"has a time different from the previous value that is specified by the lease_metrics_epsilon parameter"
=>
"each bucket is separated in time by the lease_metrics_epsilon parameter"?
- `add_lease_metrics_namespace_labels` `(bool: false)` - If this value is set to true, then `vault.expire.leases.by_expiration` will break down | ||
expiring leases by both time and namespace. For example, if x labels are expiring within one time bucket T specified by `lease_metrics_epsilon` | ||
and 40% of those leases are in namespace a and 60% of those leases are in namespace b, then having this value set to true will yield two | ||
different label sets -- one with value .4x with labels T and a, and another with value .6x and labels T and b. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I found the switch from 40% to .4x confusing and had to re-read.
I think the example is probably not necessary; what I'd prefer to say is that this is disabled by default because enabling it can lead to a large-cardinality metric.
@@ -151,7 +150,8 @@ These metrics cover measurement of token, identity, and lease operations, and co | |||
| `vault.expire.fetch-lease-times` | Time taken to fetch lease times | ms | summary | | |||
| `vault.expire.fetch-lease-times-by-token` | Time taken to fetch lease times by token | ms | summary | | |||
| `vault.expire.num_leases` | Number of all leases which are eligible for eventual expiry | leases | gauge | | |||
| `vault.expire.lease_expiration | Count of lease expirations | leases | counter | | |||
| `vault.expire.leases.by_expiration` (cluster,gauge,expiring,namespace) | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via `lease_metrics_epsilon` and `num_lease_metrics_buckets` in the telemetry stanza of a vault server configuration. The default values for these are `1hr` and `168` respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can additionally group lease expiration by namespace by setting `add_lease_metrics_namespace_labels` to `true` in the config file (default is `false`). | leases | gauge | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! :)
- `num_lease_metrics_buckets` `(int: 168)` - The number of expiry buckets for leases. For the default value, for | ||
example, 168 value labels for the `vault.expire.leases.by_expiration` metric will be reported, where each value | ||
each bucket is separated in time by the `lease_metrics_epsilon` parameter. For the default 1 hour value of | ||
`lease_metrics_epsilon` and the default value of `num_lease_metrics_buckets`, `vault.expire.leases.by_expiration` | ||
will report the total number of leases expiring within each hour from the current time to one week from the current time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like "For the default value, for example, 168 value labels..." is awkward.
- `num_lease_metrics_buckets` `(int: 168)` - The number of expiry buckets for leases. For the default value, for | |
example, 168 value labels for the `vault.expire.leases.by_expiration` metric will be reported, where each value | |
each bucket is separated in time by the `lease_metrics_epsilon` parameter. For the default 1 hour value of | |
`lease_metrics_epsilon` and the default value of `num_lease_metrics_buckets`, `vault.expire.leases.by_expiration` | |
will report the total number of leases expiring within each hour from the current time to one week from the current time. | |
- `num_lease_metrics_buckets` `(int: 168)` - Controls the number of buckets reported by the | |
`vault.expire.leases.by_expiration` metric, where each bucket is separated in time by the | |
`lease_metrics_epsilon` parameter. For the default 1 hour value of `lease_metrics_epsilon` | |
and the default value of `num_lease_metrics_buckets`, `vault.expire.leases.by_expiration` will report the total | |
number of leases expiring within each hour from the current time to one week from the current time. |
Related to: #10375