Skip to content

Commit

Permalink
Attribute sets not observed during async callbacks are not exported (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
dashpole authored Aug 10, 2023
1 parent fce0a69 commit db9422d
Show file tree
Hide file tree
Showing 3 changed files with 82 additions and 58 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ release.
([#3648](https://github.com/open-telemetry/opentelemetry-specification/pull/3648))
- MetricReader.Collect ignores Resource from MetricProducer.Produce.
([#3636](https://github.com/open-telemetry/opentelemetry-specification/pull/3636))
- Attribute sets not observed during async callbacks are not exported.
([#3242](https://github.com/open-telemetry/opentelemetry-specification/pull/3242))

### Logs

Expand Down
4 changes: 4 additions & 0 deletions specification/metrics/sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -701,6 +701,10 @@ execution.
The implementation MUST complete the execution of all callbacks for a
given instrument before starting a subsequent round of collection.

The implementation SHOULD NOT produce aggregated metric data for a
previously-observed attribute set which is not observed during a successful
callback.

### Cardinality limits

**Status**: [Experimental](../document-status.md)
Expand Down
134 changes: 76 additions & 58 deletions specification/metrics/supplementary-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -431,25 +431,30 @@ Instrument](./api.md#histogram). What if we collect measurements from an
[Asynchronous Counter](./api.md#asynchronous-counter)?

The following example shows the number of [page
faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the
thread ever started:
faults](https://en.wikipedia.org/wiki/Page_fault) of each process since
it started:

* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
* pid = `1001`, tid = `1`, #PF = `50`
* pid = `1001`, tid = `2`, #PF = `30`
* pid = `1001`, #PF = `50`
* pid = `1002`, #PF = `30`
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
* pid = `1001`, tid = `1`, #PF = `53`
* pid = `1001`, tid = `2`, #PF = `38`
* pid = `1001`, #PF = `53`
* pid = `1002`, #PF = `38`
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
* pid = `1001`, tid = `1`, #PF = `56`
* pid = `1001`, tid = `2`, #PF = `42`
* pid = `1001`, #PF = `56`
* pid = `1002`, #PF = `42`
* During the time range (T<sub>3</sub>, T<sub>4</sub>]:
* pid = `1001`, tid = `1`, #PF = `60`
* pid = `1001`, tid = `2`, #PF = `47`
* pid = `1001`, #PF = `60`
* pid = `1002`, #PF = `47`
* During the time range (T<sub>4</sub>, T<sub>5</sub>]:
* thread 1 died, thread 3 started
* pid = `1001`, tid = `2`, #PF = `53`
* pid = `1001`, tid = `3`, #PF = `5`
* process 1001 died, process 1003 started
* pid = `1002`, #PF = `53`
* pid = `1003`, #PF = `5`
* During the time range (T<sub>5</sub>, T<sub>6</sub>]:
* A new process 1001 started
* pid = `1001`, #PF = `10`
* pid = `1002`, #PF = `57`
* pid = `1003`, #PF = `8`

Note that in the following examples, Cumulative aggregation
temporality is discussed before Delta aggregation temporality because
Expand All @@ -461,47 +466,56 @@ API with specified Cumulative aggregation temporality.
If we export the metrics using **Cumulative Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `50`
* attributes: {pid = `1001`, tid = `2`}, sum: `30`
* attributes: {pid = `1001`}, sum: `50`
* attributes: {pid = `1002`}, sum: `30`
* (T<sub>0</sub>, T<sub>2</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `53`
* attributes: {pid = `1001`, tid = `2`}, sum: `38`
* attributes: {pid = `1001`}, sum: `53`
* attributes: {pid = `1002`}, sum: `38`
* (T<sub>0</sub>, T<sub>3</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `56`
* attributes: {pid = `1001`, tid = `2`}, sum: `42`
* attributes: {pid = `1001`}, sum: `56`
* attributes: {pid = `1002`}, sum: `42`
* (T<sub>0</sub>, T<sub>4</sub>]
* attributes: {pid = `1001`, tid = `1`}, sum: `60`
* attributes: {pid = `1001`, tid = `2`}, sum: `47`
* attributes: {pid = `1001`}, sum: `60`
* attributes: {pid = `1002`}, sum: `47`
* (T<sub>0</sub>, T<sub>5</sub>]
* attributes: {pid = `1001`, tid = `2`}, sum: `53`
* attributes: {pid = `1001`, tid = `3`}, sum: `5`
* attributes: {pid = `1002`}, sum: `53`
* (T<sub>4</sub>, T<sub>5</sub>]
* attributes: {pid = `1003`}, sum: `5`
* (T<sub>5</sub>, T<sub>6</sub>]
* attributes: {pid = `1001`}, sum: `10`
* (T<sub>0</sub>, T<sub>6</sub>]
* attributes: {pid = `1002`}, sum: `57`
* (T<sub>4</sub>, T<sub>6</sub>]
* attributes: {pid = `1003`}, sum: `8`

The behavior in the first four periods is quite straightforward - we
just take the data being reported from the asynchronous instruments
and send them.

The data model prescribes several valid behaviors at T<sub>5</sub> in
this case, where one stream dies and another starts. The [Resets and
Gaps](./data-model.md#resets-and-gaps) section describes how start
timestamps and staleness markers can be used to increase the
The data model prescribes several valid behaviors at T<sub>5</sub> and
T<sub>6</sub> in this case, where one stream dies and another starts.
The [Resets and Gaps](./data-model.md#resets-and-gaps) section describes
how start timestamps and staleness markers can be used to increase the
receiver's understanding of these events.

Consider whether the SDK maintains individual timestamps for the
individual stream, or just one per process. In this example, where a
thread can die and start counting page faults from zero, the valid
behaviors at T<sub>5</sub> are:
process can die and restart, it starts counting page faults from zero.
In this case, the valid behaviors at T<sub>5</sub> and T<sub>6</sub>
are:

1. If all streams in the process share a start time, and the SDK is
not required to remember all past streams: the thread restarts with
zero sum. Receivers with reset detection are able to calculate a
correct rate (except for frequent restarts relative to the
collection interval), however the precise time of a reset will be
unknown.
2. If the SDK maintains per-stream start times, it signals to the
receiver precisely when a stream started, making the first
observation in a stream more useful for diagnostics. Receivers can
perform overlap detection or duplicate suppression and do not
require reset detection, in this case.
zero sum, and the start time of the process. Receivers with reset
detection are able to calculate a correct rate (except for frequent
restarts relative to the collection interval), however the precise
time of a reset will be unknown.
2. If the SDK maintains per-stream start times, it provides the previous
callback time as the start time, as this time is before the occurrence
of any events which are measured during the subsequent callback. This
makes the first observation in a stream more useful for diagnostics,
as downstream consumers can perform overlap detection or duplicate
suppression and do not require reset detection in this case.
3. Independent of above treatments, the SDK can add a staleness marker
to indicate the start of a gap in the stream when one thread dies
by remembering which streams have previously reported but are not
Expand All @@ -519,20 +533,23 @@ data model.
If we export the metrics using **Delta Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `50`
* attributes: {pid = `1001`, tid = `2`}, delta: `30`
* attributes: {pid = `1002`}, delta: `30`
* (T<sub>1</sub>, T<sub>2</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `3`
* attributes: {pid = `1001`, tid = `2`}, delta: `8`
* attributes: {pid = `1001`}, delta: `3`
* attributes: {pid = `1002`}, delta: `8`
* (T<sub>2</sub>, T<sub>3</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `3`
* attributes: {pid = `1001`, tid = `2`}, delta: `4`
* attributes: {pid = `1001`}, delta: `3`
* attributes: {pid = `1002`}, delta: `4`
* (T<sub>3</sub>, T<sub>4</sub>]
* attributes: {pid = `1001`, tid = `1`}, delta: `4`
* attributes: {pid = `1001`, tid = `2`}, delta: `5`
* attributes: {pid = `1001`}, delta: `4`
* attributes: {pid = `1002`}, delta: `5`
* (T<sub>4</sub>, T<sub>5</sub>]
* attributes: {pid = `1001`, tid = `2`}, delta: `6`
* attributes: {pid = `1001`, tid = `3`}, delta: `5`
* attributes: {pid = `1002`}, delta: `6`
* attributes: {pid = `1003`}, delta: `5`
* (T<sub>5</sub>, T<sub>6</sub>]
* attributes: {pid = `1001`}, delta: `10`
* attributes: {pid = `1002`}, delta: `4`
* attributes: {pid = `1003`}, delta: `3`

You can see that we are performing Cumulative->Delta conversion, and it requires
us to remember the last value of **every single permutation we've encountered so
Expand Down Expand Up @@ -560,27 +577,28 @@ So here are some suggestions that we encourage SDK implementers to consider:
##### Asynchronous example: attribute removal in a view

Suppose the metrics in the asynchronous example above are exported
through a view configured to remove the `tid` attribute, leaving a
single-dimensional count of page faults by `pid`. For each metric
stream, two measurements are produced covering the same interval of
time, which the SDK is expected to aggregate before producing the
output.
through a view configured to remove the `pid` attribute, leaving a
count of page faults. For each metric stream, two measurements are produced
covering the same interval of time, which the SDK is expected to aggregate
before producing the output.

The data model specifies to use the "natural merge" function, in this
case meaning to add the current point values together because they
are `Sum` data points. The expected output is, still in **Cumulative
Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* dimensions: {pid = `1001`}, sum: `80`
* dimensions: {}, sum: `80`
* (T<sub>0</sub>, T<sub>2</sub>]
* dimensions: {pid = `1001`}, sum: `91`
* dimensions: {}, sum: `91`
* (T<sub>0</sub>, T<sub>3</sub>]
* dimensions: {pid = `1001`}, sum: `98`
* dimensions: {}, sum: `98`
* (T<sub>0</sub>, T<sub>4</sub>]
* dimensions: {pid = `1001`}, sum: `107`
* dimensions: {}, sum: `107`
* (T<sub>0</sub>, T<sub>5</sub>]
* dimensions: {pid = `1001`}, sum: `58`
* dimensions: {}, sum: `58`
* (T<sub>0</sub>, T<sub>6</sub>]
* dimensions: {}, sum: `75`

As discussed in the asynchronous cumulative temporality example above,
there are various treatments available for detecting resets. Even if
Expand Down

0 comments on commit db9422d

Please sign in to comment.