Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System metrics semantic conventions #937

Merged
merged 32 commits into from
Oct 15, 2020
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
1040fc2
System metrics semantic conventions
aabmass Sep 9, 2020
f7f2ef7
change process count to UpDownSumObserver
aabmass Sep 11, 2020
98d72a1
fix system.cpu.utilization, use better example
aabmass Sep 11, 2020
9d20079
first several comments
aabmass Sep 24, 2020
fd6375e
add description columns, update units to UCUM
aabmass Sep 24, 2020
9d871af
Merge branch 'master' into system-metrics-818
aabmass Sep 24, 2020
a0e3e2d
markdown-toc
aabmass Sep 24, 2020
7d02a69
Merge branch 'master' into system-metrics-818
aabmass Sep 28, 2020
4f7d3e1
clarify OS process level metrics
aabmass Sep 28, 2020
dc13aa2
clarify load average exapmle
aabmass Sep 28, 2020
5e7cde9
Merge branch 'master' into system-metrics-818
aabmass Oct 1, 2020
ceb99bb
move general conventions + OTEP 108 into README.md
aabmass Oct 1, 2020
45ae1f8
renamed swap -> paging
aabmass Oct 1, 2020
b3f7508
add addition fs labels
aabmass Oct 1, 2020
2512dd7
fix links
aabmass Oct 1, 2020
964c535
fix link
aabmass Oct 1, 2020
cde2393
Update specification/metrics/semantic_conventions/README.md
aabmass Oct 6, 2020
b758d24
Update specification/metrics/semantic_conventions/README.md
aabmass Oct 6, 2020
c9a37fb
Apply suggestions from code review
aabmass Oct 8, 2020
6c1c579
fix tigran comments
aabmass Oct 6, 2020
5ffcb58
add disk io_time and operation_time
aabmass Oct 8, 2020
1b90514
add descriptions/footnotes for dropped packets and net errors
aabmass Oct 8, 2020
5ffd8d0
Merge branch 'master' into system-metrics-818
aabmass Oct 8, 2020
7b14a93
lint, more info for net dropped packets/errors
aabmass Oct 8, 2020
a903783
"dropped_packets" -> "dropped"
aabmass Oct 9, 2020
c218cac
Apply suggestions from James' code review
aabmass Oct 12, 2020
09a31b7
comments from James' code review
aabmass Oct 12, 2020
fdea5e4
Merge branch 'master' into system-metrics-818
aabmass Oct 12, 2020
8fec8f9
clarify windows perf counter
aabmass Oct 12, 2020
aa5e16e
Update specification/metrics/semantic_conventions/README.md
aabmass Oct 15, 2020
aa28566
reflow text
aabmass Oct 15, 2020
7f808ab
Merge branch 'master' into system-metrics-818
aabmass Oct 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ New:
[#946](https://github.com/open-telemetry/opentelemetry-specification/pull/946))
- Update the header name for otel baggage, and version date
([#981](https://github.com/open-telemetry/opentelemetry-specification/pull/981))
- Add semantic conventions for system metrics
([#937](https://github.com/open-telemetry/opentelemetry-specification/pull/937))

Updates:

Expand Down
7 changes: 6 additions & 1 deletion specification/metrics/semantic_conventions/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Metrics Semantic Conventions

TODO: Add semantic conventions for metric names and labels.
The following semantic conventions surrounding metrics are defined:

* [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics.
* [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics.
* [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics.
* [Runtime Environment Metrics](runtime-environment-metrics.md): Semantic conventions and instruments for runtime environment metrics.

Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md),
OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own
Expand Down
21 changes: 21 additions & 0 deletions specification/metrics/semantic_conventions/process-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Semantic Conventions for Process Metrics

This document describes instruments and labels for common process level
metrics in OpenTelemetry. Also consider the general [semantic conventions for
system metrics](system-metrics.md#semantic-conventions) when creating
instruments not explicitly defined in this document.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Metric Instruments](#metric-instruments)
* [Standard Process Metrics - `process.`](#standard-process-metrics---process)

<!-- tocstop -->

## Metric Instruments

### Standard Process Metrics - `process.`

TODO
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Semantic Conventions for Runtime Environment Metrics

This document includes semantic conventions for runtime environment level
metrics in OpenTelemetry. Also consider the general semantic conventions for
[system metrics](system-metrics.md#semantic-conventions) and [process
metrics](process-metrics.md) when instrumenting runtime environments.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Metric Instruments](#metric-instruments)
* [Runtime Environment Metrics - `runtime.`](#runtime-environment-metrics---runtime)
+ [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment)

<!-- tocstop -->

## Metric Instruments

### Runtime Environment Metrics - `runtime.`

Runtime environments vary widely in their terminology, implementation, and
relative values for a given metric. For example, Go and Python are both
garbage collected languages, but comparing heap usage between the Go and
CPython runtimes directly is not meaningful. For this reason, this document
does not propose any standard top-level runtime metric instruments. See [OTEP
108](https://github.com/open-telemetry/oteps/pull/108/files) for additional
discussion.

#### Runtime Environment Specific Metrics - `runtime.{environment}.`

Metrics specific to a certain runtime environment should be prefixed with
`runtime.{environment}.` and follow the semantic conventions outlined in
[semantic conventions for system
metrics](system-metrics.md#semantic-conventions). Authors of runtime
instrumentations are responsible for the choice of `{environment}` to avoid
ambiguity when interpreting a metric's name or values.

For example, some programming languages have multiple runtime environments
that vary significantly in their implementation, like [Python which has many
implementations](https://wiki.python.org/moin/PythonImplementations). For
such languages, consider using specific `{environment}` prefixes to avoid
ambiguity, like `runtime.cpython.` and `runtime.pypy.`.

There are other dimensions even within a given runtime environment to
consider, for example pthreads vs green thread implementations.
178 changes: 178 additions & 0 deletions specification/metrics/semantic_conventions/system-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Semantic Conventions for System Metrics

This document describes instruments and labels for common system level
metrics in OpenTelemetry. Also included are general semantic conventions for
system, process, and runtime metrics, which should be considered when
aabmass marked this conversation as resolved.
Show resolved Hide resolved
creating instruments not explicitly defined in the specification.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Semantic Conventions](#semantic-conventions)
* [Instrument Naming](#instrument-naming)
* [Units](#units)
- [Metric Instruments](#metric-instruments)
* [Standard System Metrics - `system.`](#standard-system-metrics---system)
aabmass marked this conversation as resolved.
Show resolved Hide resolved
+ [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics)
+ [`system.memory.` - Memory metrics](#systemmemory---memory-metrics)
+ [`system.swap.` - Swap/paging metrics](#systemswap---swappaging-metrics)
+ [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics)
+ [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics)
+ [`system.network.` - Network metrics](#systemnetwork---network-metrics)
+ [`system.process.` - Aggregate system process metrics](#systemprocess---aggregate-system-process-metrics)
+ [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics)

<!-- tocstop -->

## Semantic Conventions

The following semantic conventions aim to keep naming consistent. They
provide guidelines for most of the cases in this specification and should be
followed for other instruments not explicitly defined in this document.

### Instrument Naming

- **limit** - an instrument that measures the constant, known total amount of
something should be called `entity.limit`. For example, `system.memory.limit`
for the total amount of memory on a system.

- **usage** - an instrument that measures an amount used out of a known total
aabmass marked this conversation as resolved.
Show resolved Hide resolved
(**limit**) amount should be called `entity.usage`. For example,
`system.memory.usage` with label `state = used | cached | free | ...` for the
amount of memory in a each state. In many cases, the sum of **usage** over
all label values is equal to the **limit**.

A measure of the amount of an unlimited resource consumed is differentiated
from **usage**.

- **utilization** - an instrument that measures the *fraction* of **usage**
out of its **limit** should be called `entity.utilization`. For example,
`system.memory.utilization` for the fraction of memory in use. Utilization
values are in the range `[0, 1]`.

- **time** - an instrument that measures passage of time should be called
aabmass marked this conversation as resolved.
Show resolved Hide resolved
`entity.time`. For example, `system.cpu.time` with label `state = idle | user
| system | ...`. **time** measurements are not necessarily wall time and can be less than
or greater than the real wall time between measurements.

**time** instruments are a special case of **usage** metrics, where the
**limit** can usually be calculated as the sum of **time** over all label
values. **utilization** can also be calculated and useful, for example
`system.cpu.utilization`.

- **io** - an instrument that measures bidirectional data flow should be
called `entity.io` and have labels for direction. For example,
`system.network.io`.
aabmass marked this conversation as resolved.
Show resolved Hide resolved

- Other instruments that do not fit the above descriptions may be named more
freely. For example, `system.swap.page_faults` and `system.network.packets`.
Units do not need to be specified in the names since they are included during
aabmass marked this conversation as resolved.
Show resolved Hide resolved
instrument creation, but can be added if there is ambiguity.

### Units

Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need
more clarification in
[#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)).

- Instruments for **utilization** metrics (that measure the fraction out of a total)
SHOULD use units of `1`.
- Instruments that measure an integer count of something have
["non-units"](https://ucum.org/ucum.html#section-Examples-for-some-Non-Units.)
and SHOULD use [annotations](https://ucum.org/ucum.html#para-curly) with curly
braces. For example `{packets}`, `{errors}`, `{faults}`, etc.

## Metric Instruments

### Standard System Metrics - `system.`
aabmass marked this conversation as resolved.
Show resolved Hide resolved

#### `system.cpu.` - Processor metrics

**Description:** System level processor metrics.

| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
aabmass marked this conversation as resolved.
Show resolved Hide resolved
| ---------------------- | ----------- | ----- | --------------- | ---------- | --------- | ----------------------------------- |
| system.cpu.time | | s | SumObserver | Double | state | idle, user, system, interrupt, etc. |
| | | | | | cpu | CPU number (0..n) |
aabmass marked this conversation as resolved.
Show resolved Hide resolved
| system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. |
| | | | | | cpu | CPU number (0..n) |

#### `system.memory.` - Memory metrics

**Description:** System level memory metrics. This does not include [paging/swap
memory](#systemswap---swappaging-metrics).

| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
| ------------------------- | ----------- | ----- | ----------------- | ---------- | --------- | ------------------------ |
| system.memory.usage | | By | UpDownSumObserver | Int64 | state | used, free, cached, etc. |
| system.memory.utilization | | 1 | ValueObserver | Double | state | used, free, cached, etc. |

#### `system.swap.` - Swap/paging metrics

**Description:** System level paging/swap memory metrics.
| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
| ---------------------------- | ----------------------------------- | ------------ | ----------------- | ---------- | --------- | ------------ |
| system.swap.usage | Unix swap or windows pagefile usage | By | UpDownSumObserver | Int64 | state | used, free |
| system.swap.utilization | | 1 | ValueObserver | Double | state | used, free |
| system.swap.page\_faults | | {faults} | SumObserver | Int64 | type | major, minor |
| system.swap.page\_operations | | {operations} | SumObserver | Int64 | type | major, minor |
| | | | | | direction | in, out |

#### `system.disk.` - Disk controller metrics

**Description:** System level disk performance metrics.
| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
| ---------------------------- | ----------- | ------------ | --------------- | ---------- | --------- | ------------ |
| system.disk.io<!--notlink--> | | By | SumObserver | Int64 | device | (identifier) |
| | | | | | direction | read, write |
| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) |
| | | | | | direction | read, write |
| system.disk.time | | s | SumObserver | Double | device | (identifier) |
| | | | | | direction | read, write |
| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) |
| | | | | | direction | read, write |

#### `system.filesystem.` - Filesystem metrics

**Description:** System level filesystem metrics.
| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
| ----------------------------- | ----------- | ----- | ----------------- | ---------- | --------- | -------------------- |
| system.filesystem.usage | | By | UpDownSumObserver | Int64 | device | (identifier) |
aabmass marked this conversation as resolved.
Show resolved Hide resolved
| | | | | | state | used, free, reserved |
| system.filesystem.utilization | | 1 | ValueObserver | Double | device | (identifier) |
| | | | | | state | used, free, reserved |

#### `system.network.` - Network metrics

**Description:** System level network metrics.
| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
| ------------------------------- | ----------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- |
| system.network.dropped\_packets | | {packets} | SumObserver | Int64 | device | (identifier) |
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved
| | | | | | direction | transmit, receive |
| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) |
| | | | | | direction | transmit, receive |
| system.network.errors | | {errors} | SumObserver | Int64 | device | (identifier) |
| | | | | | direction | transmit, receive |
| system<!--notlink-->.network.io | | By | SumObserver | Int64 | device | (identifier) |
| | | | | | direction | transmit, receive |
| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) |
| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) |
| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) |

#### `system.process.` - Aggregate system process metrics

**Description:** System level aggregate process metrics. For metrics at the
individual process level, see [process metrics](process-metrics.md).
| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values |
| -------------------- | --------------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- |
| system.process.count | Total number of processes in each state | {processes} | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) |

#### `system.{os}.` - OS Specific System Metrics

Instrument names for system level metrics that have different and conflicting
meaning across multiple OSes should be prefixed with `system.{os}.` and
follow the hierarchies listed above for different entities like CPU, memory,
and network. For example, an instrument for measuring the load average on
Linux could be named `system.linux.cpu.load`, reusing the `cpu` name proposed
above.
aabmass marked this conversation as resolved.
Show resolved Hide resolved