Skip to content

Commit

Permalink
Jsonnet / Helm: relax the hash ring heartbeat period and timeout for …
Browse files Browse the repository at this point in the history
…distributor, ingester, store-gateway and compactor (#6860)

* Relaxed the hash ring heartbeat period and timeout for distributor, ingester, store-gateway and compactor.

These are values which help reduce the pressure on a KV store or reduce the CPU spent by memberlist in passing messages.
The tradeoff is that abrupt shutdowns/crashes of components will take longer to detect by peers.
We've been running with these values at Grafana Labs for some time and haven't seen problems.

Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

* Fixed store-gateway config in Helm

Signed-off-by: Marco Pracucci <[email protected]>

* Fix distributor ring config in Helm

Signed-off-by: Marco Pracucci <[email protected]>

* Ignore non impactful differences between Jsonnet and Helm

Signed-off-by: Marco Pracucci <[email protected]>

* Fix ignore non impactful differences between Jsonnet and Helm

Signed-off-by: Marco Pracucci <[email protected]>

* Fix ignore non impactful differences between Jsonnet and Helm

Signed-off-by: Marco Pracucci <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
  • Loading branch information
dimitarvdimitrov and pracucci authored Jan 18, 2024
1 parent cea56dc commit d1bc422
Show file tree
Hide file tree
Showing 80 changed files with 808 additions and 5 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,14 @@
* [CHANGE] rollout-operator: remove default CPU limit. #7066
* [CHANGE] Store-gateway: Increase `JAEGER_REPORTER_MAX_QUEUE_SIZE` from the default (100) to 1000, to avoid dropping tracing spans. #7068
* [CHANGE] Query-frontend, ingester, ruler, backend and write instances: Increase `JAEGER_REPORTER_MAX_QUEUE_SIZE` from the default (100), to avoid dropping tracing spans. #7086
* [CHANGE] Ring: relaxed the hash ring heartbeat period and timeout for distributor, ingester, store-gateway and compactor: #6860
* `-distributor.ring.heartbeat-period` set to `1m`
* `-distributor.ring.heartbeat-timeout` set to `4m`
* `-ingester.ring.heartbeat-period` set to `2m`
* `-store-gateway.sharding-ring.heartbeat-period` set to `1m`
* `-store-gateway.sharding-ring.heartbeat-timeout` set to `4m`
* `-compactor.ring.heartbeat-period` set to `1m`
* `-compactor.ring.heartbeat-timeout` set to `4m`
* [FEATURE] Added support for the following root-level settings to configure the list of matchers to apply to node affinity: #6782 #6829
* `alertmanager_node_affinity_matchers`
* `compactor_node_affinity_matchers`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,14 @@ patches:
- op: remove
path: /config/frontend_worker
- target:
kind: MimirConfig
name: 'alertmanager|compactor|distributor|overrides-exporter|querier|query-frontend|query-scheduler|ruler|store-gateway'
patch: |-
# Jsonnet configures the ingester ring heartbeat period only on the ingester (other components don't need it).
- op: remove
path: /config/ingester/ring/heartbeat_period
- target:
kind: MimirConfig
name: 'alertmanager|compactor|store-gateway|query-frontend|query-scheduler|overrides-exporter'
Expand Down Expand Up @@ -177,10 +185,36 @@ patches:
- op: remove
path: /config/alertmanager/fallback_config_file
- target:
kind: MimirConfig
name: 'alertmanager|compactor|ingester|overrides-exporter|querier|query-frontend|query-scheduler|ruler|store-gateway'
patch: |-
# Jsonnet configures the distributor ring only on the distributor (other components don't need it).
- op: remove
path: /config/distributor/ring/heartbeat_period
- op: remove
path: /config/distributor/ring/heartbeat_timeout
- target:
kind: MimirConfig
name: 'alertmanager|compactor|distributor|ingester|overrides-exporter|querier|query-frontend|query-scheduler|ruler'
patch: |-
# Jsonnet doesn't set this on non-store-gateway components
- op: remove
path: /config/store_gateway/sharding_ring/unregister_on_shutdown
- target:
kind: MimirConfig
name: 'alertmanager|compactor|distributor|ingester|overrides-exporter|querier|query-frontend|query-scheduler|ruler'
patch: |-
# Jsonnet configures the store-gateway ring heartbeat period only on the store-gateway (other components don't need it).
- op: remove
path: /config/store_gateway/sharding_ring/heartbeat_period
- target:
kind: MimirConfig
name: 'alertmanager|compactor|distributor|ingester|overrides-exporter|query-frontend|query-scheduler'
patch: |-
# Jsonnet configures the store-gateway ring heartbeat timeout only on components using the store-gateway ring (store-gateway, querier, ruler).
- op: remove
path: /config/store_gateway/sharding_ring/heartbeat_timeout
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ config:
ring:
instance_availability_zone:
num_tokens:
heartbeat_timeout:
unregister_on_shutdown:
distributor:
ha_tracker:
Expand Down
9 changes: 9 additions & 0 deletions operations/helm/charts/mimir-distributed/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,15 @@ Entries should include a reference to the Pull Request that introduced the chang
## main / unreleased

* [CHANGE] Rollout-operator: remove default CPU limit. #7125
* [CHANGE] Ring: relaxed the hash ring heartbeat period and timeout for distributor, ingester, store-gateway and compactor: #6860
* `-distributor.ring.heartbeat-period` set to `1m`
* `-distributor.ring.heartbeat-timeout` set to `4m`
* `-ingester.ring.heartbeat-period` set to `2m`
* `-ingester.ring.heartbeat-timeout` set to `10m`
* `-store-gateway.sharding-ring.heartbeat-period` set to `1m`
* `-store-gateway.sharding-ring.heartbeat-timeout` set to `4m`
* `-compactor.ring.heartbeat-period` set to `1m`
* `-compactor.ring.heartbeat-timeout` set to `4m`
* [ENHANCEMENT] Add `jaegerReporterMaxQueueSize` Helm value for all components where configuring `JAEGER_REPORTER_MAX_QUEUE_SIZE` makes sense, and override the Jaeger client's default value of 100 for components expected to generate many trace spans. #7068 #7086
* [ENHANCEMENT] Rollout-operator: upgraded to v0.10.1. #7125
* [ENHANCEMENT] Query-frontend: configured `-shutdown-delay`, `-server.grpc.keepalive.max-connection-age` and termination grace period to reduce the likelihood of queries hitting terminated query-frontends. #7129
Expand Down
11 changes: 11 additions & 0 deletions operations/helm/charts/mimir-distributed/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,13 @@ mimir:
data_dir: "/data"
sharding_ring:
wait_stability_min_duration: 1m
heartbeat_period: 1m
heartbeat_timeout: 4m
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
Expand Down Expand Up @@ -292,6 +299,8 @@ mimir:
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
heartbeat_period: 2m
heartbeat_timeout: 10m
{{- if .Values.ingester.zoneAwareReplication.enabled }}
zone_awareness_enabled: true
{{- end }}
Expand Down Expand Up @@ -372,6 +381,8 @@ mimir:
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
{{- if .Values.store_gateway.zoneAwareReplication.enabled }}
kvstore:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
cache_results: true
grpc_client_config:
Expand Down Expand Up @@ -192,6 +198,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
kvstore:
store: memberlist
num_tokens: 512
Expand Down Expand Up @@ -291,6 +299,8 @@ data:
key_file: /certs/tls.key
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
scheduler_address: gateway-enterprise-values-mimir-query-scheduler-headless.citestns.svc:9095
Expand Down Expand Up @@ -99,6 +105,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -146,6 +154,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
scheduler_address: gateway-nginx-values-mimir-query-scheduler-headless.citestns.svc:9095
Expand All @@ -61,6 +67,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -102,6 +110,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
scheduler_address: graphite-enabled-values-mimir-query-scheduler-headless.citestns.svc:9095
Expand Down Expand Up @@ -123,6 +129,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -169,6 +177,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
tokens_file_path: /data/tokens
unregister_on_shutdown: false
wait_stability_min_duration: 1m
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
cache_results: true
parallelize_shardable_queries: true
Expand All @@ -74,6 +80,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -113,6 +121,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
scheduler_address: metamonitoring-values-mimir-query-scheduler-headless.citestns.svc:9095
Expand All @@ -61,6 +67,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -101,6 +109,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
tokens_file_path: /data/tokens
unregister_on_shutdown: false
wait_stability_min_duration: 1m
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
scheduler_address: openshift-values-mimir-query-scheduler-headless.citestns.svc:9095
Expand Down Expand Up @@ -77,6 +83,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -116,6 +124,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,14 @@ data:
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
frontend:
parallelize_shardable_queries: true
scheduler_address: scheduler-name-values-mimir-query-scheduler-headless.citestns.svc:9095
Expand All @@ -46,6 +52,8 @@ data:
ingester:
ring:
final_sleep: 0s
heartbeat_period: 2m
heartbeat_timeout: 10m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: false
Expand Down Expand Up @@ -79,6 +87,8 @@ data:
grpc_server_max_connection_idle: 1m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
kvstore:
prefix: multi-zone/
tokens_file_path: /data/tokens
Expand Down
Loading

0 comments on commit d1bc422

Please sign in to comment.