-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Services not unregistered from Consul #17079
Comments
Here is a full log of a zobie alloc
|
I updated to nomad 1.5.5 but this bug stays. |
But with open issue hashicorp/nomad#17079
Ok. I have put in a little effort to get to the bottom of it. I enable the drain_on_shutdown {
deadline = "2m"
force = false
ignore_system_jobs = true
} But that's dont make the situation better. My epectation is that the My second approach is to call a script over systemd Stop hook.
nomad_node_drain.sh #!/bin/bash
if [ ! -f "/home/{{ansible_user}}/notdrain" ] ; then nomad node drain -enable -self -deadline "2m" -m "Node shutdown" -yes -address=https://localhost:4646 -ca-cert=/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem -client-cert=/etc/opt/certs/nomad/nomad.pem -client-key=/etc/opt/certs/nomad/nomad-key.pem;fi With that approch I have never seen the dead allocations. But I must start the master node first. Otherwise the issue is the same. The approach to drain the node over script comes with the caveat that the node is not eligable after boot. my workarround for make the node eligable again is described here. |
@suikast42 still the wrong issue, I think. Please pay attention to where you're posting. |
IMHO not. This solves my problem particularly. Which topic are you suggest ? |
It is interesting in those logs that we see (reverse chronological)
with nothing following the @suikast42 any chance you can produce a goroutine dump of the Nomad process when one of them gets into this state? And can you provide as much of a real job spec as you can? |
If you tell me how. Sure ✌ |
You can just use For the goroutine dump, just find the
The standard out/err logs of the Nomad client should then contain a whole bunch of goroutine stack trace information. That should at least help us know if a hook is stuck waiting on something. |
{
"Job": {
"Affinities": null,
"AllAtOnce": false,
"Constraints": null,
"ConsulNamespace": "",
"ConsulToken": "",
"CreateIndex": 97,
"Datacenters": [
"nomadder1"
],
"DispatchIdempotencyToken": "",
"Dispatched": false,
"ID": "observability",
"JobModifyIndex": 97,
"Meta": null,
"Migrate": null,
"ModifyIndex": 404,
"Multiregion": null,
"Name": "observability",
"Namespace": "default",
"NomadTokenID": "",
"ParameterizedJob": null,
"ParentID": "",
"Payload": null,
"Periodic": null,
"Priority": 50,
"Region": "global",
"Reschedule": null,
"Spreads": null,
"Stable": true,
"Status": "running",
"StatusDescription": "",
"Stop": false,
"SubmitTime": 1683559547429931764,
"TaskGroups": [
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "grafana",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": [
{
"HostNetwork": "default",
"Label": "ui",
"To": 3000,
"Value": 0
},
{
"HostNetwork": "default",
"Label": "connect-proxy-grafana",
"To": -1,
"Value": 0
}
],
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": null
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "health",
"OnUpdate": "require_healthy",
"Path": "/healthz",
"PortLabel": "ui",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": {
"Gateway": null,
"Native": false,
"SidecarService": {
"DisableDefaultTCPCheck": false,
"Meta": null,
"Port": "",
"Proxy": null,
"Tags": null
},
"SidecarTask": {
"Config": {
"labels": [
{
"com.github.logunifier.application.pattern.key": "envoy"
}
]
},
"Driver": "",
"Env": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "",
"Resources": {
"CPU": 100,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 300,
"MemoryMaxMB": 0,
"Networks": null
},
"ShutdownDelay": 0,
"User": ""
}
},
"EnableTagOverride": false,
"Meta": null,
"Name": "grafana",
"OnUpdate": "require_healthy",
"PortLabel": "3000",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": [
"traefik.enable=true",
"traefik.consulcatalog.connect=true",
"traefik.http.routers.grafana.tls=true",
"traefik.http.routers.grafana.rule=Host(`grafana.cloud.private`)"
],
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"ports": [
"ui"
],
"image": "registry.cloud.private/stack/observability/grafana:9.5.1.0",
"labels": [
{
"com.github.logunifier.application.pattern.key": "logfmt",
"com.github.logunifier.application.version": "9.5.1.0",
"com.github.logunifier.application.name": "grafana"
}
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": {
"GF_AUTH_OAUTH_AUTO_LOGIN": "true",
"GF_PATHS_CONFIG": "/etc/grafana/grafana2.ini",
"GF_PATHS_PLUGINS": "/data/grafana/plugins",
"GF_SERVER_DOMAIN": "grafana.cloud.private",
"GF_SERVER_ROOT_URL": "https://grafana.cloud.private",
"GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH": "contains(realm_access.roles[*], 'admin') && 'GrafanaAdmin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
},
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "grafana",
"Resources": {
"CPU": 500,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"MemoryMaxMB": 4096,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeScript": null,
"ChangeSignal": "",
"DestPath": "${NOMAD_SECRETS_DIR}/env.vars",
"EmbeddedTmpl": " {{ with nomadVar \"nomad/jobs/observability\" }}\n GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET = {{.keycloak_secret_observability_grafana}}\n {{ end }}\n",
"Envvars": true,
"ErrMissingKey": false,
"Gid": null,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"Uid": null,
"VaultGrace": 0,
"Wait": null
}
],
"User": "",
"Vault": null,
"VolumeMounts": [
{
"Destination": "/var/lib/grafana",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "stack_observability_grafana_volume"
}
]
},
{
"Affinities": null,
"Artifacts": null,
"Config": {
"labels": [
{
"com.github.logunifier.application.pattern.key": "envoy"
}
],
"image": "${meta.connect.sidecar_image}",
"args": [
"-c",
"${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
"-l",
"${meta.connect.log_level}",
"--concurrency",
"${meta.connect.proxy_concurrency}",
"--disable-hot-restart"
]
},
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.8.0"
},
{
"LTarget": "${attr.consul.grpc}",
"Operand": ">",
"RTarget": "0"
}
],
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "connect-proxy:grafana",
"Leader": false,
"Lifecycle": {
"Hook": "prestart",
"Sidecar": true
},
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "connect-proxy-grafana",
"Resources": {
"CPU": 100,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 300,
"MemoryMaxMB": 0,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": null,
"User": "",
"Vault": null,
"VolumeMounts": null
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": {
"stack_observability_grafana_volume": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "stack_observability_grafana_volume",
"PerAlloc": false,
"ReadOnly": false,
"Source": "stack_observability_grafana_volume",
"Type": "host"
}
}
},
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "mimir",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": null,
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": [
{
"HostNetwork": "default",
"Label": "api",
"To": 9009,
"Value": 9009
}
]
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "health",
"OnUpdate": "require_healthy",
"Path": "/ready",
"PortLabel": "api",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "mimir",
"OnUpdate": "require_healthy",
"PortLabel": "api",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"args": [
"-config.file",
"/config/mimir.yaml",
"-config.expand-env",
"true"
],
"image": "registry.cloud.private/grafana/mimir:2.8.0",
"labels": [
{
"com.github.logunifier.application.version": "2.8.0",
"com.github.logunifier.application.name": "mimir",
"com.github.logunifier.application.pattern.key": "logfmt"
}
],
"ports": [
"api"
],
"volumes": [
"local/mimir.yml:/config/mimir.yaml"
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": {
"JAEGER_ENDPOINT": "http://tempo-jaeger.service.consul:14268/api/traces?format=jaeger.thrift",
"JAEGER_REPORTER_LOG_SPANS": "true",
"JAEGER_SAMPLER_PARAM": "1",
"JAEGER_SAMPLER_TYPE": "const",
"JAEGER_TRACEID_128BIT": "true"
},
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "mimir",
"Resources": {
"CPU": 500,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"MemoryMaxMB": 32768,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeScript": null,
"ChangeSignal": "",
"DestPath": "local/mimir.yml",
"EmbeddedTmpl": "\n# Test ++ env \"NOMAD_ALLOC_NAME\" ++\n# Do not use this configuration in production.\n# It is for demonstration purposes only.\n\n# Run Mimir in single process mode, with all components running in 1 process.\ntarget: all,alertmanager,overrides-exporter\n# Disable tendency support.\nmultitenancy_enabled: false\n\nserver:\n http_listen_port: 9009\n log_level: debug\n # Configure the server to allow messages up to 100MB.\n grpc_server_max_recv_msg_size: 104857600\n grpc_server_max_send_msg_size: 104857600\n grpc_server_max_concurrent_streams: 1000\n\nblocks_storage:\n backend: filesystem\n bucket_store:\n sync_dir: /data/tsdb-sync\n #ignore_blocks_within: 10h # default 10h\n filesystem:\n dir: /data/blocks\n tsdb:\n dir: /data/tsdb\n # Note that changing this requires changes to some other parameters like\n # -querier.query-store-after,\n # -querier.query-ingesters-within and\n # -blocks-storage.bucket-store.ignore-blocks-within.\n # retention_period: 24h # default 24h\nquerier:\n # query_ingesters_within: 13h # default 13h\n #query_store_after: 12h #default 12h\nruler_storage:\n backend: filesystem\n filesystem:\n dir: /data/rules\n\nalertmanager_storage:\n backend: filesystem\n filesystem:\n dir: /data/alarms\n\nfrontend:\n grpc_client_config:\n grpc_compression: snappy\n\nfrontend_worker:\n grpc_client_config:\n grpc_compression: snappy\n\ningester_client:\n grpc_client_config:\n grpc_compression: snappy\n\nquery_scheduler:\n grpc_client_config:\n grpc_compression: snappy\n\nalertmanager:\n data_dir: /data/alertmanager\n# retention: 120h\n sharding_ring:\n replication_factor: 1\n alertmanager_client:\n grpc_compression: snappy\n\nruler:\n query_frontend:\n grpc_client_config:\n grpc_compression: snappy\n\ncompactor:\n# compaction_interval: 1h # default 1h\n# deletion_delay: 12h # default 12h\n max_closing_blocks_concurrency: 2\n max_opening_blocks_concurrency: 4\n symbols_flushers_concurrency: 4\n data_dir: /data/compactor\n sharding_ring:\n kvstore:\n store: memberlist\n\n\ningester:\n ring:\n replication_factor: 1\n\nstore_gateway:\n sharding_ring:\n replication_factor: 1\n\nlimits:\n # Limit queries to 5 years. You can override this on a per-tenant basis.\n max_total_query_length: 43680h\n max_label_names_per_series: 42\n # Allow ingestion of out-of-order samples up to 2 hours since the latest received sample for the series.\n out_of_order_time_window: 1d\n # delete old blocks from long-term storage.\n # Delete from storage metrics data older than 1d.\n compactor_blocks_retention_period: 1d\n ingestion_rate: 100000",
"Envvars": false,
"ErrMissingKey": false,
"Gid": null,
"LeftDelim": "++",
"Perms": "0644",
"RightDelim": "++",
"SourcePath": "",
"Splay": 5000000000,
"Uid": null,
"VaultGrace": 0,
"Wait": null
}
],
"User": "",
"Vault": null,
"VolumeMounts": [
{
"Destination": "/data",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "stack_observability_mimir_volume"
}
]
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": {
"stack_observability_mimir_volume": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "stack_observability_mimir_volume",
"PerAlloc": false,
"ReadOnly": false,
"Source": "stack_observability_mimir_volume",
"Type": "host"
}
}
},
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "loki",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": null,
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": [
{
"HostNetwork": "default",
"Label": "http",
"To": 3100,
"Value": 3100
},
{
"HostNetwork": "default",
"Label": "cli",
"To": 7946,
"Value": 7946
},
{
"HostNetwork": "default",
"Label": "grpc",
"To": 9095,
"Value": 9005
}
]
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "health",
"OnUpdate": "require_healthy",
"Path": "/ready",
"PortLabel": "http",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "loki",
"OnUpdate": "require_healthy",
"PortLabel": "http",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": [
"prometheus",
"prometheus:server_id=${NOMAD_ALLOC_NAME}",
"prometheus:version=2.9.16"
],
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"labels": [
{
"com.github.logunifier.application.name": "loki",
"com.github.logunifier.application.pattern.key": "logfmt",
"com.github.logunifier.application.version": "2.8.2"
}
],
"ports": [
"http",
"cli",
"grpc"
],
"volumes": [
"local/loki.yaml:/config/loki.yaml"
],
"args": [
"-config.file",
"/config/loki.yaml",
"-config.expand-env",
"true",
"-print-config-stderr",
"true"
],
"image": "registry.cloud.private/grafana/loki:2.8.2"
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": {
"JAEGER_SAMPLER_PARAM": "1",
"JAEGER_SAMPLER_TYPE": "const",
"JAEGER_TRACEID_128BIT": "true",
"JAEGER_ENDPOINT": "http://tempo-jaeger.service.consul:14268/api/traces?format=jaeger.thrift",
"JAEGER_REPORTER_LOG_SPANS": "true"
},
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "loki",
"Resources": {
"CPU": 500,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"MemoryMaxMB": 32768,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeScript": null,
"ChangeSignal": "",
"DestPath": "local/loki.yaml",
"EmbeddedTmpl": "auth_enabled: false\n\nserver:\n #default 3100\n http_listen_port: 3100\n #default 9005\n #grpc_listen_port: 9005\n # Max gRPC message size that can be received\n # CLI flag: -server.grpc-max-recv-msg-size-bytes\n #default 4194304 -> 4MB\n grpc_server_max_recv_msg_size: 419430400\n\n # Max gRPC message size that can be sent\n # CLI flag: -server.grpc-max-send-msg-size-bytes\n #default 4194304 -> 4MB\n grpc_server_max_send_msg_size: 419430400\n\n # Limit on the number of concurrent streams for gRPC calls (0 = unlimited)\n # CLI flag: -server.grpc-max-concurrent-streams\n grpc_server_max_concurrent_streams: 100\n\n # Log only messages with the given severity or above. Supported values [debug,\n # info, warn, error]\n # CLI flag: -log.level\n log_level: \"warn\"\ningester:\n wal:\n enabled: true\n dir: /data/wal\n lifecycler:\n address: 127.0.0.1\n ring:\n kvstore:\n store: memberlist\n replication_factor: 1\n final_sleep: 0s\n chunk_idle_period: 5m\n chunk_retain_period: 30s\n chunk_encoding: snappy\n\nruler:\n evaluation_interval : 1m\n poll_interval: 1m\n storage:\n type: local\n local:\n directory: /data/rules\n rule_path: /data/scratch\n++- range $index, $service := service \"mimir\" -++\n++- if eq $index 0 ++\n alertmanager_url: http://++$service.Name++.service.consul:++ $service.Port ++/alertmanager\n++- end ++\n++- end ++\n\n ring:\n kvstore:\n store: memberlist\n enable_api: true\n enable_alertmanager_v2: true\n\ncompactor:\n working_directory: /data/retention\n shared_store: filesystem\n compaction_interval: 10m\n retention_enabled: true\n retention_delete_delay: 2h\n retention_delete_worker_count: 150\n\nschema_config:\n configs:\n - from: 2023-03-01\n store: boltdb-shipper\n object_store: filesystem\n schema: v12\n index:\n prefix: index_\n period: 24h\n\nstorage_config:\n boltdb_shipper:\n active_index_directory: /data/index\n cache_location: /data/index-cache\n shared_store: filesystem\n filesystem:\n directory: /data/chunks\n index_queries_cache_config:\n enable_fifocache: false\n embedded_cache:\n max_size_mb: 4096\n enabled: true\nquerier:\n multi_tenant_queries_enabled: false\n max_concurrent: 4096\n query_store_only: false\n\nquery_scheduler:\n max_outstanding_requests_per_tenant: 10000\n\nquery_range:\n cache_results: true\n results_cache:\n cache:\n enable_fifocache: false\n embedded_cache:\n enabled: true\n\nchunk_store_config:\n chunk_cache_config:\n enable_fifocache: false\n embedded_cache:\n max_size_mb: 4096\n enabled: true\n write_dedupe_cache_config:\n enable_fifocache: false\n embedded_cache:\n max_size_mb: 4096\n enabled: true\n\ndistributor:\n ring:\n kvstore:\n store: memberlist\n\ntable_manager:\n retention_deletes_enabled: true\n retention_period: 24h\n\nlimits_config:\n ingestion_rate_mb: 64\n ingestion_burst_size_mb: 8\n max_label_name_length: 4096\n max_label_value_length: 8092\n enforce_metric_name: false\n # Loki will reject any log lines that have already been processed and will not index them again\n reject_old_samples: false\n # 5y\n reject_old_samples_max_age: 43800h\n # The limit to length of chunk store queries. 0 to disable.\n # 5y\n max_query_length: 43800h\n # Maximum number of log entries that will be returned for a query.\n max_entries_limit_per_query: 20000\n # Limit the maximum of unique series that is returned by a metric query.\n max_query_series: 100000\n # Maximum number of queries that will be scheduled in parallel by the frontend.\n max_query_parallelism: 64\n split_queries_by_interval: 24h\n # Alter the log line timestamp during ingestion when the timestamp is the same as the\n # previous entry for the same stream. When enabled, if a log line in a push request has\n # the same timestamp as the previous line for the same stream, one nanosecond is added\n # to the log line. This will preserve the received order of log lines with the exact\n # same timestamp when they are queried, by slightly altering their stored timestamp.\n # NOTE: This is imperfect, because Loki accepts out of order writes, and another push\n # request for the same stream could contain duplicate timestamps to existing\n # entries and they will not be incremented.\n # CLI flag: -validation.increment-duplicate-timestamps\n increment_duplicate_timestamp: true\n #Log data retention for all\n retention_period: 24h\n # Comment this out for fine grained retention\n# retention_stream:\n# - selector: '{namespace=\"dev\"}'\n# priority: 1\n# period: 24h\n # Comment this out for having overrides\n# per_tenant_override_config: /etc/overrides.yaml",
"Envvars": false,
"ErrMissingKey": false,
"Gid": null,
"LeftDelim": "++",
"Perms": "0644",
"RightDelim": "++",
"SourcePath": "",
"Splay": 5000000000,
"Uid": null,
"VaultGrace": 0,
"Wait": null
}
],
"User": "",
"Vault": null,
"VolumeMounts": [
{
"Destination": "/data",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "stack_observability_loki_volume"
}
]
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": {
"stack_observability_loki_volume": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "stack_observability_loki_volume",
"PerAlloc": false,
"ReadOnly": false,
"Source": "stack_observability_loki_volume",
"Type": "host"
}
}
},
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "tempo",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": null,
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": [
{
"HostNetwork": "default",
"Label": "jaeger",
"To": 14268,
"Value": 14268
},
{
"HostNetwork": "default",
"Label": "tempo",
"To": 3200,
"Value": 3200
},
{
"HostNetwork": "default",
"Label": "otlp_grpc",
"To": 4317,
"Value": 4317
},
{
"HostNetwork": "default",
"Label": "otlp_http",
"To": 4318,
"Value": 4318
},
{
"HostNetwork": "default",
"Label": "zipkin",
"To": 9411,
"Value": 9411
}
]
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "health",
"OnUpdate": "require_healthy",
"Path": "/ready",
"PortLabel": "tempo",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "tempo",
"OnUpdate": "require_healthy",
"PortLabel": "tempo",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
},
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": null,
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "tempo_zipkin_check",
"OnUpdate": "require_healthy",
"Path": "",
"PortLabel": "",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 1000000000,
"Type": "tcp"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "tempo-zipkin",
"OnUpdate": "require_healthy",
"PortLabel": "zipkin",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
},
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": null,
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "tempo_jaeger_check",
"OnUpdate": "require_healthy",
"Path": "",
"PortLabel": "",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 1000000000,
"Type": "tcp"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "tempo-jaeger",
"OnUpdate": "require_healthy",
"PortLabel": "jaeger",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
},
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": null,
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "tempo_otlp_grpc_check",
"OnUpdate": "require_healthy",
"Path": "",
"PortLabel": "",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 1000000000,
"Type": "tcp"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "tempo-otlp-grpc",
"OnUpdate": "require_healthy",
"PortLabel": "otlp_grpc",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
},
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": null,
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "tempo_otlp_http_check",
"OnUpdate": "require_healthy",
"Path": "",
"PortLabel": "",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 1000000000,
"Type": "tcp"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "tempo-otlp-http",
"OnUpdate": "require_healthy",
"PortLabel": "otlp_http",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"args": [
"-config.file",
"/config/tempo.yaml",
"-config.expand-env",
"true"
],
"image": "registry.cloud.private/grafana/tempo:2.1.1",
"labels": [
{
"com.github.logunifier.application.version": "2.1.1",
"com.github.logunifier.application.name": "tempo",
"com.github.logunifier.application.pattern.key": "logfmt"
}
],
"ports": [
"jaeger",
"tempo",
"otlp_grpc",
"otlp_http",
"zipkin"
],
"volumes": [
"local/tempo.yaml:/config/tempo.yaml"
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": {
"JAEGER_SAMPLER_TYPE": "const",
"JAEGER_TRACEID_128BIT": "true",
"JAEGER_REPORTER_LOG_SPANS": "true",
"JAEGER_SAMPLER_PARAM": "1"
},
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "tempo",
"Resources": {
"CPU": 500,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"MemoryMaxMB": 32768,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeScript": null,
"ChangeSignal": "",
"DestPath": "local/tempo.yaml",
"EmbeddedTmpl": "multitenancy_enabled: false\n\nserver:\n http_listen_port: 3200\n\ndistributor:\n receivers: # this configuration will listen on all ports and protocols that tempo is capable of.\n jaeger: # the receives all come from the OpenTelemetry collector. more configuration information can\n protocols: # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver\n thrift_http: #\n grpc: # for a production deployment you should only enable the receivers you need!\n thrift_binary:\n thrift_compact:\n zipkin:\n otlp:\n protocols:\n http:\n grpc:\n opencensus:\n\ningester:\n trace_idle_period: 10s # the length of time after a trace has not received spans to consider it complete and flush it\n max_block_bytes: 1_000_000 # cut the head block when it hits this size or ...\n max_block_duration: 5m # this much time passes\n\ncompactor:\n compaction:\n compaction_window: 1h # blocks in this time window will be compacted together\n max_block_bytes: 100_000_000 # maximum size of compacted blocks\n block_retention: 24h # Duration to keep blocks 1d\n\nmetrics_generator:\n registry:\n external_labels:\n source: tempo\n cluster: nomadder1\n storage:\n path: /data/generator/wal\n remote_write:\n++- range service \"mimir\" ++\n - url: http://++.Name++.service.consul:++.Port++/api/v1/push\n send_exemplars: true\n headers:\n x-scope-orgid: 1\n++- end ++\n\nstorage:\n trace:\n backend: local # backend configuration to use\n block:\n bloom_filter_false_positive: .05 # bloom filter false positive rate. lower values create larger filters but fewer false positives\n wal:\n path: /data/wal # where to store the the wal locally\n local:\n path: /data/blocks\n pool:\n max_workers: 100 # worker pool determines the number of parallel requests to the object store backend\n queue_depth: 10000\n\nquery_frontend:\n search:\n # how to define year here ? define 5 years\n max_duration: 43800h\n\noverrides:\n metrics_generator_processors: [service-graphs, span-metrics]",
"Envvars": false,
"ErrMissingKey": false,
"Gid": null,
"LeftDelim": "++",
"Perms": "0644",
"RightDelim": "++",
"SourcePath": "",
"Splay": 5000000000,
"Uid": null,
"VaultGrace": 0,
"Wait": null
}
],
"User": "",
"Vault": null,
"VolumeMounts": [
{
"Destination": "/data",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "stack_observability_tempo_volume"
}
]
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": {
"stack_observability_tempo_volume": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "stack_observability_tempo_volume",
"PerAlloc": false,
"ReadOnly": false,
"Source": "stack_observability_tempo_volume",
"Type": "host"
}
}
},
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "nats",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": [
{
"HostNetwork": "default",
"Label": "http",
"To": 8222,
"Value": 0
},
{
"HostNetwork": "default",
"Label": "cluster",
"To": 6222,
"Value": 0
},
{
"HostNetwork": "default",
"Label": "prometheus-exporter",
"To": 7777,
"Value": 0
}
],
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": [
{
"HostNetwork": "default",
"Label": "client",
"To": 4222,
"Value": 4222
}
]
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "service: \"nats\" check",
"OnUpdate": "require_healthy",
"Path": "/healthz",
"PortLabel": "http",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "nats",
"OnUpdate": "require_healthy",
"PortLabel": "client",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
},
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 5000000000,
"Method": "",
"Name": "service: \"nats-prometheus-exporter\" check",
"OnUpdate": "require_healthy",
"Path": "/metrics",
"PortLabel": "prometheus-exporter",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "nats-prometheus-exporter",
"OnUpdate": "require_healthy",
"PortLabel": "prometheus-exporter",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": [
"prometheus",
"prometheus:server_id=${NOMAD_ALLOC_NAME}",
"prometheus:version=2.9.16"
],
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"args": [
"-varz",
"-channelz",
"-connz",
"-gatewayz",
"-leafz",
"-serverz",
"-subz",
"-jsz=all",
"-use_internal_server_id",
"http://localhost:${NOMAD_PORT_http}"
],
"image": "registry.cloud.private/natsio/prometheus-nats-exporter:0.11.0",
"labels": [
{
"com.github.logunifier.application.name": "prometheus-nats-exporter",
"com.github.logunifier.application.pattern.key": "tslevelmsg",
"com.github.logunifier.application.version": "0.11.0.0"
}
],
"ports": [
"prometheus_exporter"
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": {
"Hook": "poststart",
"Sidecar": true
},
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "nats-prometheus-exporter",
"Resources": {
"CPU": 100,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 300,
"MemoryMaxMB": 0,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": null,
"User": "",
"Vault": null,
"VolumeMounts": null
},
{
"Affinities": null,
"Artifacts": null,
"Config": {
"ports": [
"client",
"http",
"cluster"
],
"volumes": [
"local/nats.conf:/config/nats.conf"
],
"args": [
"-c",
"/config/nats.conf",
"-js"
],
"image": "registry.cloud.private/nats:2.9.16-alpine",
"labels": [
{
"com.github.logunifier.application.pattern.key": "tslevelmsg",
"com.github.logunifier.application.version": "2.9.16",
"com.github.logunifier.application.name": "nats"
}
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "nats",
"Resources": {
"CPU": 500,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"MemoryMaxMB": 32768,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeScript": null,
"ChangeSignal": "",
"DestPath": "local/nats.conf",
"EmbeddedTmpl": "# Client port of ++ env \"NOMAD_PORT_client\" ++ on all interfaces\nport: ++ env \"NOMAD_PORT_client\" ++\n\n# HTTP monitoring port\nmonitor_port: ++ env \"NOMAD_PORT_http\" ++\nserver_name: \"++ env \"NOMAD_ALLOC_NAME\" ++\"\n#If true enable protocol trace log messages. Excludes the system account.\ntrace: false\n#If true enable protocol trace log messages. Includes the system account.\ntrace_verbose: false\n#if true enable debug log messages\ndebug: false\nhttp_port: ++ env \"NOMAD_PORT_http\" ++\n#http: nats.service.consul:++ env \"NOMAD_PORT_http\" ++\n\njetstream {\n store_dir: /data/jetstream\n\n # 1GB\n max_memory_store: 2G\n\n # 10GB\n max_file_store: 10G\n}\n",
"Envvars": false,
"ErrMissingKey": false,
"Gid": null,
"LeftDelim": "++",
"Perms": "0644",
"RightDelim": "++",
"SourcePath": "",
"Splay": 5000000000,
"Uid": null,
"VaultGrace": 0,
"Wait": null
}
],
"User": "",
"Vault": null,
"VolumeMounts": [
{
"Destination": "/data/jetstream",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "stack_observability_nats_volume"
}
]
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": {
"stack_observability_nats_volume": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "stack_observability_nats_volume",
"PerAlloc": false,
"ReadOnly": false,
"Source": "stack_observability_nats_volume",
"Type": "host"
}
}
},
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "grafana-agent",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": [
{
"HostNetwork": "default",
"Label": "server",
"To": 0,
"Value": 0
}
],
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": null
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "service: \"grafana-agent-health\" check",
"OnUpdate": "require_healthy",
"Path": "/-/healthy",
"PortLabel": "server",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "grafana-agent-health",
"OnUpdate": "require_healthy",
"PortLabel": "server",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
},
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 5
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "service: \"grafana-agent-ready\" check",
"OnUpdate": "require_healthy",
"Path": "/-/ready",
"PortLabel": "server",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "grafana-agent-ready",
"OnUpdate": "require_healthy",
"PortLabel": "server",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"labels": [
{
"com.github.logunifier.application.version": "0.33.1",
"com.github.logunifier.application.name": "grafana_agent",
"com.github.logunifier.application.pattern.key": "logfmt"
}
],
"volumes": [
"local/agent.yaml:/config/agent.yaml"
],
"args": [
"-config.file",
"/config/agent.yaml",
"-server.http.address",
":${NOMAD_HOST_PORT_server}"
],
"image": "registry.cloud.private/grafana/agent:v0.33.1"
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "grafana-agent",
"Resources": {
"CPU": 100,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 64,
"MemoryMaxMB": 2048,
"Networks": null
},
"RestartPolicy": {
"Attempts": 1,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeScript": null,
"ChangeSignal": "",
"DestPath": "local/agent.yaml",
"EmbeddedTmpl": "server:\n log_level: info\n\nmetrics:\n wal_directory: \"/data/wal\"\n global:\n scrape_interval: 5s\n remote_write:\n++- range service \"mimir\" ++\n - url: http://++.Name++.service.consul:++.Port++/api/v1/push\n++- end ++\n configs:\n - name: integrations\n scrape_configs:\n - job_name: integrations/traefik\n scheme: http\n metrics_path: '/metrics'\n static_configs:\n - targets:\n
- ingress.cloud.private:8081\n # grab all metric endpoints with stadanrd /metrics endpoint\n - job_name: \"integrations/consul_sd\"\n consul_sd_configs:\n - server: \"consul.service.consul:8501\"\n tags: [\"prometheus\"]\n tls_config:\n insecure_skip_verify: true\n ca_file: \"/certs/ca/ca.crt\"\n cert_file: \"/certs/consul/consul.pem\"\n key_file: \"/certs/consul/consul-key.pem\"\n datacenter: \"nomadder1\"\n scheme: https\n relabel_configs:\n - source_labels: [__meta_consul_node]\n target_label: instance\n - source_labels: [__meta_consul_service]\n target_label: service\n# - source_labels: [__meta_consul_tags]\n# separator: ','\n# regex: 'prometheus:([^=]+)=([^,]+)'\n# target_label: '$${1}'\n# replacement: '$${2}'\n
- source_labels: [__meta_consul_tags]\n separator: ','\n regex: '.*,prometheus:server_id=([^,]+),.*'\n target_label: 'server_id'\n replacement: '$${1}'\n - source_labels: [__meta_consul_tags]\n separator: ','\n regex: '.*,prometheus:version=([^,]+),.*'\n target_label: 'version'\n replacement: '$${1}'\n - source_labels: ['__meta_consul_tags']\n target_label: 'labels'\n regex: '(.+)'\n replacement: '$${1}'\n action: 'keep'\n # - action: replace\n # replacement: '1'\n #
target_label: 'test'\n metric_relabel_configs:\n - action: labeldrop\n
regex: 'exported_.*'\n\n\n - job_name: \"integrations/consul_sd_minio\"\n metrics_path: \"/minio/v2/metrics/cluster\"\n consul_sd_configs:\n - server: \"consul.service.consul:8501\"\n tags: [\"prometheus_minio\"]\n tls_config:\n
insecure_skip_verify: true\n ca_file: \"/certs/ca/ca.crt\"\n cert_file: \"/certs/consul/consul.pem\"\n key_file: \"/certs/consul/consul-key.pem\"\n
datacenter: \"nomadder1\"\n scheme: https\n relabel_configs:\n - source_labels: [__meta_consul_node]\n target_label: instance\n - source_labels: [__meta_consul_service]\n target_label: service\n# - source_labels: [__meta_consul_tags]\n# separator: ','\n# regex: 'prometheus:([^=]+)=([^,]+)'\n# target_label: '$${1}'\n# replacement: '$${2}'\n - source_labels: [__meta_consul_tags]\n separator: ','\n regex: '.*,prometheus:server=([^,]+),.*'\n target_label: 'server'\n replacement: '$${1}'\n - source_labels: [__meta_consul_tags]\n separator: ','\n
regex: '.*,prometheus:version=([^,]+),.*'\n target_label: 'version'\n replacement: '$${1}'\n - source_labels: ['__meta_consul_tags']\n target_label: 'labels'\n regex: '(.+)'\n replacement: '$${1}'\n action: 'keep'\n# - action: replace\n# replacement: '38'\n# target_label: 'test'\n metric_relabel_configs:\n - action: labeldrop\n regex: 'exported_.*'",
"Envvars": false,
"ErrMissingKey": false,
"Gid": null,
"LeftDelim": "++",
"Perms": "0644",
"RightDelim": "++",
"SourcePath": "",
"Splay": 5000000000,
"Uid": null,
"VaultGrace": 0,
"Wait": null
}
],
"User": "",
"Vault": null,
"VolumeMounts": [
{
"Destination": "/data/wal",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "stack_observability_grafana_agent_volume"
},
{
"Destination": "/certs/ca",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "ca_certs"
},
{
"Destination": "/certs/consul",
"PropagationMode": "private",
"ReadOnly": false,
"Volume": "cert_consul"
}
]
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": {
"cert_consul": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "cert_consul",
"PerAlloc": false,
"ReadOnly": true,
"Source": "cert_consul",
"Type": "host"
},
"stack_observability_grafana_agent_volume": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "stack_observability_grafana_agent_volume",
"PerAlloc": false,
"ReadOnly": false,
"Source": "stack_observability_grafana_agent_volume",
"Type": "host"
},
"ca_certs": {
"AccessMode": "",
"AttachmentMode": "",
"MountOptions": null,
"Name": "ca_certs",
"PerAlloc": false,
"ReadOnly": true,
"Source": "ca_cert",
"Type": "host"
}
}
},
{
"Affinities": null,
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"Operand": "semver",
"RTarget": ">= 1.7.0"
}
],
"Consul": {
"Namespace": ""
},
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"MaxClientDisconnect": null,
"Meta": null,
"Migrate": {
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000
},
"Name": "logunifier",
"Networks": [
{
"CIDR": "",
"DNS": null,
"Device": "",
"DynamicPorts": [
{
"HostNetwork": "default",
"Label": "health",
"To": 3000,
"Value": 0
}
],
"Hostname": "",
"IP": "",
"MBits": 0,
"Mode": "bridge",
"ReservedPorts": null
}
],
"ReschedulePolicy": {
"Attempts": 0,
"Delay": 10000000000,
"DelayFunction": "constant",
"Interval": 0,
"MaxDelay": 3600000000000,
"Unlimited": true
},
"RestartPolicy": {
"Attempts": 3,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"Scaling": null,
"Services": [
{
"Address": "",
"AddressMode": "auto",
"CanaryMeta": null,
"CanaryTags": null,
"CheckRestart": null,
"Checks": [
{
"AddressMode": "",
"Advertise": "",
"Args": null,
"Body": "",
"CheckRestart": {
"Grace": 60000000000,
"IgnoreWarnings": false,
"Limit": 3
},
"Command": "",
"Expose": false,
"FailuresBeforeCritical": 0,
"GRPCService": "",
"GRPCUseTLS": false,
"Header": null,
"InitialStatus": "",
"Interval": 10000000000,
"Method": "",
"Name": "service: \"logunifier-health\" check",
"OnUpdate": "require_healthy",
"Path": "/health",
"PortLabel": "health",
"Protocol": "",
"SuccessBeforePassing": 0,
"TLSSkipVerify": false,
"TaskName": "",
"Timeout": 2000000000,
"Type": "http"
}
],
"Connect": null,
"EnableTagOverride": false,
"Meta": null,
"Name": "logunifier-health",
"OnUpdate": "require_healthy",
"PortLabel": "health",
"Provider": "consul",
"TaggedAddresses": null,
"Tags": null,
"TaskName": ""
}
],
"ShutdownDelay": null,
"Spreads": null,
"StopAfterClientDisconnect": null,
"Tasks": [
{
"Affinities": null,
"Artifacts": null,
"Config": {
"labels": [
{
"com.github.logunifier.application.name": "logunifier",
"com.github.logunifier.application.pattern.key": "tslevelmsg",
"com.github.logunifier.application.strip.ansi": "true",
"com.github.logunifier.application.version": "0.1.1"
}
],
"ports": [
"health"
],
"args": [
"-loglevel",
"debug",
"-natsServers",
"nats.service.consul:4222",
"-lokiServers",
"loki.service.consul:9005"
],
"image": "registry.cloud.private/suikast42/logunifier:0.1.1"
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"Identity": null,
"KillSignal": "",
"KillTimeout": 5000000000,
"Kind": "",
"Leader": false,
"Lifecycle": null,
"LogConfig": {
"Disabled": false,
"Enabled": null,
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "logunifier",
"Resources": {
"CPU": 100,
"Cores": 0,
"Devices": null,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 64,
"MemoryMaxMB": 2048,
"Networks": null
},
"RestartPolicy": {
"Attempts": 3,
"Delay": 5000000000,
"Interval": 3600000000000,
"Mode": "fail"
},
"ScalingPolicies": null,
"Services": null,
"ShutdownDelay": 0,
"Templates": null,
"User": "",
"Vault": null,
"VolumeMounts": null
}
],
"Update": {
"AutoPromote": false,
"AutoRevert": true,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"ProgressDeadline": 3600000000000,
"Stagger": 30000000000
},
"Volumes": null
}
],
"Type": "service",
"Update": {
"AutoPromote": false,
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "",
"HealthyDeadline": 0,
"MaxParallel": 1,
"MinHealthyTime": 0,
"ProgressDeadline": 0,
"Stagger": 30000000000
},
"VaultNamespace": "",
"VaultToken": "",
"Version": 0
}
} |
And here is the log |
@suikast42 After upgrading to 1.5.5 and restarting all nodes, I can see that zombie services get cleared from Consul. After running my ZooKeeper job for a while, the problem seems to come back. Like you, I also see a lot of unregistered services even after the job is stopped. There are actually only three ZK instances, but there are 60++ instances there. This results in DNS service discovery giving away wrong IP addresses to services that no longer exist, causing services to fail. |
FYI. I have the same issue. After doing It seems there is some bug where nomad does not inform consul to deregister. Very frustrating. |
My obeservation is that the start and stop order of the cluster is important. Shutdown
Boot
If I don't respect this order then the "zombies" comes back. |
@suikast42 (or others), any chance you can turn on TRACE level logging on the Nomad client, and send those? @gulducat and I have each spent a few hours trying to reproduce the symptoms here but neither of us have been able to. The extra verbose logging may help us understand what we need for a reproduction. |
I'm also seeing this new behavior in 1.5.5 after upgrading from 1.4.5. Lots of dead zombie registrations left in Consul. |
For sure. EDIT:
|
Can confirm the same |
For me at least, stopping the affected jobs in Nomad does not deregister the zombie instances from Consul, only the instances Nomad lists within the job's allocations are deregistered. |
What about drain node restart nomad and elig the node again ? |
The pattern I'm seeing for zombie instances in Consul is a task failing it's health check, going into "fail" mode and nomad reallocating the task group. The failed tasks are never culled from Consul. Important to note, this doesn't happen 100% of the time. Some additional info: After upgrading to Consul 1.14.x, I had to add this config to Nomad clients to get Connect sidecars working again, otherwise the Connect sidecars would fail to communicate with Consul over gRPC (probably unrelated, but thought I would mention it anyway): consul {
grpc_address = "127.0.0.1:8503"
grpc_ca_file = "/opt/consul/tls/ca-cert.pem"
# ...
} |
There is some chanages in grpc communication. I am at consul 1.15.2. This issues does not relate to consul. I think the issue is on nomad side. You can delete all pssing services fron consul but nomad registers the zombies again. |
Let me also add my context, I don't use any special settings for Consul, and I don't use Consul Connect either. But I have Consul ACL enabled. I am using Consul 1.15.1. It only happens to my ZooKeeper jobs. All of my other jobs remain stable. My ZooKeeper jobs use TCP and script health checkings. These jobs seem to fail often, I don't know why. It seems that some health checks got missed and allocations got reallocated, so bugs could probably be searched near these logic. |
This PR fixes a bug where issuing a restart to a terminal allocation would cause the allocation to run its hooks anyway. This was particularly apparent with group_service_hook who would then register services but then never deregister them - as the allocation would be effectively in a "zombie" state where it is prepped to run tasks but never will. Fixes #17079 Fixes #16238 Fixes #14618
@Artanicus at the moment the most helpful thing would be for folks to build and run this branch of Nomad as their Client on an affected node, to see if it helps. No need to update servers. This is the 1.5.5 release but with one extra commit (279061f) The full contributing guide, but basically building Nomad is
|
Hello! 👋 We've been seeing this issue as well. It is consistently reproducible by doing the following:
A reliable way to clean up Consul is by restarting the Nomad agent with the rogue service.
|
ACK, doing some ambient testing now with a couple of nodes. Doing a Also did some active tests with a slightly modified job that doesn't require raw-exec, I unfortunately am not seeing it having fixed the issue, with the caveat that it repeats for me just from the job's failure / it's own retry scheduling logic without any manual restarts of failed allocs needed. repro job: http.nomadjob "http" {
type = "service"
group "group" {
network {
mode = "host"
port "http" {}
}
count = 5
spread {
attribute = "${node.unique.name}"
}
volume "httproot" {
type = "host"
source = "repro"
read_only = "true"
}
service {
name = "myhttp"
port = "http"
check {
name = "c1"
type = "http"
port = "http"
path = "/hi.txt"
interval = "5s"
timeout = "1s"
check_restart {
limit = 3
grace = "30s"
}
}
}
restart {
attempts = 1
delay = "10s"
mode = "fail"
}
task "python" {
driver = "exec"
volume_mount {
volume = "httproot"
destination = "/srv"
read_only = true
}
config {
command = "python3"
args = ["-m", "http.server", "${NOMAD_PORT_http}", "--directory", "/srv"]
}
resources {
cpu = 10
memory = 64
}
}
}
}
Once I got the allocs running smoothly (more on that further down), here's some tests: case1: Making one node with 1.5.5 (
|
Thank you @Artanicus that is very helpful. We might ask you to run another binary in the near future, with a lot more trace logging statements if we still can't figure out what's going on. I've genuinely put 20 or so hours into trying to create a reproduction beyond the If you can, could you try to reproduce again the simplest job possible (count = 1, constraint on the node with a |
Here's an attempt at a more hermetic and simple repro with a standlone vagrant dev agent.
VagrantfileVagrant.configure("2") do |config|
config.vm.box = "generic/debian11"
config.vm.network "private_network", ip: "192.168.56.10"
config.vm.synced_folder ".", "/local"
config.vm.provision "shell", inline: <<-SHELL
install /local/nomad /usr/bin/nomad
SHELL
end Version details:
http.nomadjob "http" {
type = "service"
group "group" {
network {
mode = "host"
port "http" {}
}
count = 1
service {
name = "myhttp"
port = "http"
provider = "nomad"
check {
name = "c1"
type = "http"
port = "http"
path = "/hi.txt"
interval = "5s"
timeout = "1s"
check_restart {
limit = 3
grace = "30s"
}
}
}
restart {
attempts = 1
delay = "10s"
mode = "fail"
}
task "python" {
driver = "raw_exec"
config {
command = "python3"
args = ["-m", "http.server", "${NOMAD_PORT_http}", "--directory", "/tmp"]
}
resources {
cpu = 10
memory = 64
}
}
}
} At first I let it run for a good while without
|
That true if you use dynamic port binding. If you do healthcheck on static port than the situation is the worst secenario. the helathcheck is ok but the lb delegates to a dead alloc |
Here's an even simpler repro, this time starting from a healthy state. Same Vagrantfile, same agent setup. After removing
After
Even after a If I perform manual service deletes: %> nomad service info -verbose myhttp
ID = _nomad-task-6200cf76-f53e-ed05-511d-e99f4b805c67-group-group-myhttp-http
Service Name = myhttp
Namespace = default
Job ID = http
Alloc ID = 6200cf76-f53e-ed05-511d-e99f4b805c67
Node ID = 9c8d38ff-89c2-e00e-9817-e3e632e637f1
Datacenter = dc1
Address = 127.0.0.1:30086
Tags = []
ID = _nomad-task-843ca52a-1445-c0ca-50e8-e2ffbc96d522-group-group-myhttp-http
Service Name = myhttp
Namespace = default
Job ID = http
Alloc ID = 843ca52a-1445-c0ca-50e8-e2ffbc96d522
Node ID = 9c8d38ff-89c2-e00e-9817-e3e632e637f1
Datacenter = dc1
Address = 127.0.0.1:23706
Tags = []
%> nomad service delete myhttp _nomad-task-843ca52a-1445-c0ca-50e8-e2ffbc96d522-group-group-myhttp-http
Successfully deleted service registration The delete does seem to stick, unlike what we were seeing on the Consul side where they got re-created. Even after deleting both service entries, there's still constant log spam though of
Full log: |
Thanks again @Artanicus, I think we understand what happened now - this bug was originally fixed in 889c5aa0f7 and intended to be backported to Unfortunately something in our pipeline lost/forgot/ignored the If you could try to reproduce again using branch manual-backport-889c5aa0f7, it should hopefully just be fixed. |
The branch manual-backport-889c5aa0f7 doesn't seem to exist anymore but since it seems like this made it into release/1.5.x I'll try with that.
So it's at least a significant step in a good direction. With the consul service provider I imagine the transient failing services are registered but since they're failing health checks, won't affect downstream loadbalancing. With built-in nomad services it's a bit more iffy since there seems to be no distinction between healthy and unhealthy services, but that I believe is a wholly separate bug/feature to debate and not in scope here. :-) State snapshot for posterity:
Full agent log: |
Thanks again for the testing @Artanicus! We're planning to cut a bugfix release on Monday for this and a few other issues. |
Indeed, Nomad's built-in services actually do contain health check information, but the result of those healchecks is currently only stored on the Client that executed the check and are used for passing/failing deployments, rather than LB availability. You can view their status per-allocation with |
It seems that v 1.5.6 solves that. I am closing this issue. |
For some reason, issue still exists for me after upgrade to 1.6.1 on server and 1.5.6 on clients :( . |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
The issue #16616 is marked as solved. But I have testest with nomad 1.5.1 - 1.5.4 In all the version the same issue is present
Dead allow with diffrent port as previous alloc

Dead allow with same port as previous alloc

If I run the script below the dead allocs which are unhealthy dissaperas from consul but registred again by nomad after few seconds.
EDIT: I stopped the job with nomad job stop --purge event then nomad registers all the dead allocs
All allocs not present after purge the job
Nomad tries to check dead alloc
Edit 2:
Restarting the nomad service on the worker after boot is a workarround but definity not for production.
The text was updated successfully, but these errors were encountered: