Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify how to stop and restart metrics reporting #2711

Open
jmacd opened this issue Aug 4, 2022 · 2 comments
Open

Specify how to stop and restart metrics reporting #2711

jmacd opened this issue Aug 4, 2022 · 2 comments
Assignees
Labels
[label deprecated] triaged-needmoreinfo [label deprecated] The issue is triaged - the OTel community needs more information to decide spec:metrics Related to the specification/metrics directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Aug 4, 2022

What are you trying to achieve?

Trying to address
open-telemetry/opentelemetry-js#2997
and
#1891

There are situations when a metrics SDK wants to stop reporting data for a particular instrument and attribute set. This comes about differently for asynchronous/synchronous instruments, depending on cardinality choice.

In every combination of sync/async and delta/cumulative, we find the situation may arise. We find that to safely stop reporting metrics requires attention to what information is lost to the consumer, especially where it may lead to inaccurate rate calculations.

For example:

  • synchronous/cumulative: cumulative implies long-term memory use, so need to stop reporting when too much is too much; the question is what start time to use when restarting, and what it says about the rate in the interim period
  • synchronous/delta: this case is always safe, SDKs are free to stop reporting an instrument/attribute pair when it has not been used during a collection cycle; this leaves a gap, considered normal for delta temporality
  • asynchronous/cumulative: in this case, it is safe to stop reporting the instrument/attribute pair--the user is free to simply not observe the attributes. This will leave a gap in the record, it's is not considered good practice for cumulative reporting
  • asynchronous/delta: in this case, it is safe to stop reporting the instrument/attribute pair, but restarting the same instrument/attribute pair is complicated for the same reason as synchronous/cumulative.

What did you expect to see?

In the 8/3 Prometheus-WG SIG meeting this was discussed. An idea to use the NO_DATA_PRESENT staleness marker as a way to communicate to the consumer. There appears to be some benefit to issuing NO_DATA_PRESENT data points for a period of time before being allowed to forget the value and erase it from memory.

Informally, I think we expect to see that in case the same instrument/attributes pair is re-used immediately, the new start time assigned will be no earlier than the last NO_DATA_PRESENT data point that was written. Ideally, the new start time assigned will be no later than the previous collection timestamp..

@jmacd jmacd added the spec:metrics Related to the specification/metrics directory label Aug 4, 2022
@jack-berg
Copy link
Member

asynchronous/delta: in this case, it is safe to stop reporting the instrument/attribute pair, but restarting the same instrument/attribute pair is complicated for the same reason as synchronous/cumulative.

In the past we've talked about asynchronous instruments being responsible for managing time series. That is, if a callback stops reporting a particular series, its ok for the SDK to forget it and stop reporting it in both cumulative and delta cases. If they later start reporting the series again, the SDK starts reporting the values, but delta aggregations don't need to report the diff between the latest and last reported value (before the reporting stopped) - the initial reported delta is the first recorded value.

If you buy this, then it's up to the callback to understand their role in timeseries management, and understand the semantic meaning when they stop reporting a series.

@aabmass
Copy link
Member

aabmass commented Jun 29, 2023

Looking at the new cardinality limits in the spec, specifically for synchronous instruments it says

Views of synchronous instruments with cumulative aggregation temporality MUST continue to export the all attribute sets that were observed prior to the beginning of overflow. Metric events corresponding with attribute sets that were not observed prior to the overflow will be reflected in a single data point described by (only) the overflow attribute.

IOW always keep the oldest streams and collapse any new ones into the overflow. However, there are high cardinality use cases where you may never see the old streams again and you want to free up that memory for something new.

This is a MUST requirement in the spec right now–I think we need to discuss this issue before marking the cardinality limits section stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[label deprecated] triaged-needmoreinfo [label deprecated] The issue is triaged - the OTel community needs more information to decide spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

No branches or pull requests

5 participants