Skip to content

Commit

Permalink
EDU-3257: Updates Task-Queue metrics coverage to reduce confusion (#3156
Browse files Browse the repository at this point in the history
)

* EDU-3257: Updates Task-Queue metrics coverage to reduce confusion

- "Describe" returns info that isn't metrics
- METRIC ACCURACY warnings are unnecessary clutter and scary

* Update docs/develop/worker-performance.mdx

Co-authored-by: Brian P. Hogan <[email protected]>

---------

Co-authored-by: Brian P. Hogan <[email protected]>
  • Loading branch information
fairlydurable and napcs authored Oct 17, 2024
1 parent 33d4585 commit d331684
Showing 1 changed file with 61 additions and 72 deletions.
133 changes: 61 additions & 72 deletions docs/develop/worker-performance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,99 +37,50 @@ The Workflow Cache is created and shared between all the workers. It's designed

These options limit the resource consumption of the in-memory Workflow cache. Workflow cache options are shared between all Workers, because the Workflow cache is something that has to do with the resource consumption of the whole host, like memory and the total amount of threads, and should be limited per JVM.

## Monitor Task Queue backlog metrics {#task-queue-metrics}
## Available Task Queue information {#task-queue-metrics}

A [Task Queue](https://docs.temporal.io/workers#task-queue) is a lightweight, dynamically allocated queue.
[Worker Entities](https://docs.temporal.io/workers#worker-entity) poll the queue for [Tasks](https://docs.temporal.io/workers#task).
The Temporal Service dynamically creates different [Task Queue types](/workers#task-queue) including Activity Task Queues, Workflow Task Queues, and Nexus Task Queues.
These Task Queue types route their Tasks to Workers for Task completion.

With an accurate estimate of backlog Tasks, you can determine the optimal number of Workers to deploy.
Balance your Worker count with the number of Tasks to achieve the best performance.
This approach minimizes Task backlog saturation and reduces idle Workers.

Task Queue metrics provide numerical insights into your Task Queue activity and backlog characteristics.
Use these metrics to tune your production deployments.
Evaluate your Worker loads and assess whether you need to scale up or reduce your Worker deployment.

### Task Queue types {#task-queue-types}

For each Task Queue name, Temporal creates separate queues for each Task Queue type, namely:
:::tip Support, stability, and dependency info

- **Activity Task Queue**: A queue that holds Activity Tasks.
Activity Tasks represent units of work within a larger business process.
Each Activity Task contains the context required by an Activity Definition.
Workers poll Activity Tasks from the Activity Task Queue and use them to initiate Activity Executions.
- **Workflow Task Queue**: A queue that holds Workflow Tasks.
Workflow Tasks contain the context needed by a Workflow Definition.
Workers poll Workflow Tasks from the Workflow Task Queue and use them to initiate Workflow Executions.
- **Nexus Task Queue** ([Public Preview](/evaluate/development-production-features/release-stages#public-preview)): A queue that holds Nexus Tasks.
Nexus Tasks are used for units of work passed between Namespaces.
Workers configured with the Nexus Service poll the Nexus Task Queue for available tasks and handle them by initiating Workflow Executions in the target Namespace.
Each Nexus Task contains the context required by the target Workflow Definition.
The information listed in this section is readable using the `DescribeTaskQueueEnhanced` method in the [Go SDK](https://github.com/temporalio/sdk-go/blob/74320648ab0e4178b1fedde01672f9b5b9f6c898/client/client.go), with the [Temporal CLI](https://github.com/temporalio/cli/releases/tag/v1.1.0) `task-queue describe` command, and using `DescribeTaskQueue` through [RPC](https://github.com/temporalio/temporal/blob/c4b6b34f33f4772564c8ae759f2e907e3651a65a/service/matching/matching_engine.go).

Each Task Queue type provides unaggregated metrics.
:::

### Task Queue metrics {#task-queue-metrics-list}

The Temporal Service reports information separately for each Task Queue type (not aggregated).
Use the following metrics to retrieve detailed information about Task Queue health and performance.
Available metrics include:
Use the following Task Queue properties to retrieve and evaluate information about Task Queue health and performance.
Available data include:

- [`ApproximateBacklogCount`](#ApproximateBacklogCount)
- [`ApproximateBacklogAge`](#ApproximateBacklogAge)
- [`ApproximateBacklogCount`](#ApproximateBacklogCountAndAge) and [`ApproximateBacklogAge`](#ApproximateBacklogCountAndAge)
- [`TasksAddRate`](#TasksAddRate-and-TasksDispatchRate) and [`TasksDispatchRate`](#TasksAddRate-and-TasksDispatchRate)
- [`BacklogIncreaseRate`](#BacklogIncreaseRate) (derived from [`TasksAddRate`](#TasksAddRate-and-TasksDispatchRate) and [`TasksDispatchRate`](#TasksAddRate-and-TasksDispatchRate))

#### `ApproximateBacklogCount` {#ApproximateBacklogCount}
### `ApproximateBacklogCount` and `ApproximateBacklogAge` {#ApproximateBacklogCountAndAge}

Represents the approximate count of Tasks currently backlogged in this Task Queue.
`ApproximateBacklogCount` represents the approximate count of Tasks currently backlogged in this Task Queue.
The number may include expired Tasks as well as active Tasks, but it will eventually converge to the correct count over time.

You can rely on this count when making scaling decisions.

:::info Metric Accuracy

Workflow Task Queue types provide partial information due to performance optimizations.
Tasks sent to [Sticky](https://docs.temporal.io/workers#sticky-execution) queues are not included in the returned values for this metric.
Since Tasks remain valid for only a few seconds in Sticky Queues, this inaccuracy diminishes over time, especially as the backlog grows.
:::

#### `ApproximateBacklogAge` {#ApproximateBacklogAge}

Returns the approximate age of the oldest Task in the backlog.
`ApproximateBacklogAge` returns the approximate age of the oldest Task in the backlog.
The age is based on the creation time of the Task at the head of the queue.

You can rely on this count when making scaling decisions.

:::info Metric Accuracy
You can rely on both these counts when making scaling decisions.

Workflow Task Queue types provide partial information due to performance optimizations.
Tasks sent to [Sticky](https://docs.temporal.io/workers#sticky-execution) queues are not included in the returned values.
Since Tasks remain valid for only a few seconds in Sticky Queues, this inaccuracy diminishes over time, especially when the backlog is older than a few seconds.
Please note: [Sticky queues](https://docs.temporal.io/workers#sticky-execution) will affect these values, but only for a few seconds.
That's because Tasks sent to Sticky queues are not included in the returned values for `ApproximateBacklogCount` and `ApproximateBacklogAge`.
Inaccuracies diminish as the backlog grows.

:::

#### `TasksAddRate` and `TasksDispatchRate` {#TasksAddRate-and-TasksDispatchRate}
### `TasksAddRate` and `TasksDispatchRate` {#TasksAddRate-and-TasksDispatchRate}

Reports the approximate Tasks-per-second added to or dispatched from a Task Queue.
This rate is averaged over the most recent 30-second time interval.
The calculations include Tasks that were added to or dispatched from the backlog as well as Tasks that were immediately dispatched and bypassed the backlog (sync-matched).

:::info Metric Accuracy

The actual Task delivery count may be significantly higher than the number reported by these metrics:
The actual Task delivery count may be significantly higher than the number reported by these two values:

- Eager dispatch refers to a Temporal feature where Activities can be requested by an SDK using one Workflow Task completion response.
Tasks using Eager dispatch do not pass through Task Queues.
- A Sticky Task Queue is associated with a dedicated Worker instance.
Tasks passed to Sticky Task Queues are not accounted for by these metrics.
Normally, only the first Workflow Task of each Workflow is placed on a Workflow Task Queue.
Subsequent Tasks are passed to the Sticky Task Queue for performance improvement.
- Tasks passed to Sticky Task Queues not included in the returned values for `TasksAddRate` and `TasksDispatchRate`.

:::

#### `BacklogIncreaseRate` {#BacklogIncreaseRate}
### `BacklogIncreaseRate` {#BacklogIncreaseRate}

Approximates the _net_ Tasks per second added to the backlog, averaged over the most recent 30 seconds.
This is calculated as:
Expand All @@ -141,13 +92,31 @@ TasksAddRate - TasksDispatchRate
- Positive values of `X` indicate the backlog is growing by about `X` Tasks per second.
- Negative values of `X` indicate the backlog is shrinking by about `X` Tasks per second.

:::info Metric Accuracy

While individual `add` and `dispatch` rates may be inaccurate due to Eager and Sticky Task Queues, the `BacklogIncreaseRate` reliably reflects the rate at which the backlog is shrinking or growing for backlogs older than a few seconds.

## Evaluate Task Queue performance {#evaluate-worker-loads}

A [Task Queue](https://docs.temporal.io/workers#task-queue) is a lightweight, dynamically allocated queue.
[Worker Entities](https://docs.temporal.io/workers#worker-entity) poll the queue for [Tasks](https://docs.temporal.io/workers#task) and retrieve Tasks to work on.
Tasks are contexts that a Worker progresses using a specific Workflow Execution, Activity Execution, or a Nexus Task Execution.
Each Task Queue type offers its Tasks to compatible Workers for Task completion.
The Temporal Service dynamically creates different [Task Queue types](/workers#task-queue) including Activity Task Queues, Workflow Task Queues, and Nexus Task Queues.

With an accurate estimate of backlog Tasks, you can determine the optimal number of Workers to deploy.
Balance your Worker count with the number of Tasks to achieve the best performance.
This approach minimizes Task backlog saturation and reduces idle Workers.

Task Queue data provide numerical insights into your Task Queue activity and backlog characteristics.
Use these numbers to tune your production deployments.
Evaluate your Worker loads and assess whether you need to scale up or reduce your Worker deployment.

:::note RATE LIMITS

[Visibility API rate limits](/cloud/limits#visibility-api-rate-limit) apply to Task Queue performance data requests.

:::

### Fetch Worker metrics {#retrieve-worker-info}
### Query Task Queue info with Temporal CLI {#cli-task-queue-info}

The Temporal CLI helps you monitor and evaluate Worker performance.
Issue the following command to display a list of active Workers that have recently polled a Task Queue:
Expand All @@ -168,7 +137,27 @@ This feature may significantly change or be removed in a future release.

:::

#### Evaluate Worker availability and capacity issues {#worker-capacity-issues}
### Query Task Queue info with the Go SDK {#go-sdk-task-queue-info}

Retrieve Task Queue data using the Go SDK by calling `DescribeTaskQueueEnhanced`.
Specify the Task Queue name and set `ReportStats` to `true`, as in the following example:

```go
for _, taskQueueName := range taskQueueNames {
resp, err := s.client.DescribeTaskQueueEnhanced(ctx, client.DescribeTaskQueueEnhancedOptions{
TaskQueue: taskQueueName,
ReportStats: true,
})
if err != nil {
log.Printf("Error describing task queue %s: %v", taskQueueName, err)
}

// Get the backlog count from the enhanced response
backlogCount += getBacklogCount(resp)
}
```

### Evaluate Worker availability and capacity issues {#worker-capacity-issues}

Each Temporal [Server](https://docs.temporal.io/clusters#temporal-server) records the last time of each poll request.
This time is displayed in the `temporal task-queue describe` output.
Expand All @@ -181,7 +170,7 @@ This time is displayed in the `temporal task-queue describe` output.
- Values over 5 minutes since the last poll request usually suggest that Workers have shut down or been removed.
Workers are removed if 5 minutes have passed since the last poll request.

#### Manage your Worker fleet {#manage-your-worker-fleet}
### Manage your Worker fleet {#manage-your-worker-fleet}

You can adjust the number of Workers to enhance Workflow Execution performance and manage your fleet size.
For instance, a large backlog of Tasks with too few Workers will slow down Workflow Execution completions and decrease processing efficiency.
Expand Down

0 comments on commit d331684

Please sign in to comment.