You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.
Scenario:
Mimir configured to run ring on Consul (but the issue could happen with any ring backend, memberlist included)
Store-gateway is overloaded while lazy-loading a large number of big block's index-headers
Store-gateway fails to heartbeat the ring for X consecutive minutes, where X is > -store-gateway.sharding-ring.heartbeat-timeout
At the 1st next sync, the store-gateway detects itself as unhealthy and drop all its blocks
At the 2nd next sync, the store-gateway detects itself as healthy and re-adds all its blocks, causing additional load to the store-gateway itself
Zoom in into a specific store-gateway
The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one: store-gateway-zone-b-3.
The sequence of related block syncs are:
level=info ts=2022-05-02T09:06:19.395846082Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-changeShow context
level=info ts=2022-05-02T09:07:03.365919167Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
# In this sync all blocks have been dropped.
level=info ts=2022-05-02T09:07:03.3659588Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:08:55.148949647Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
# In this sync all blocks have been re-added. The sync took 37 minutes.
level=info ts=2022-05-02T09:08:55.148992704Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826814806Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826872774Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.853222892Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
Querying the metric cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"} we can see the number of successful CAS operations did not increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always 2369 during that timeframe:
The store-gateway is running with -store-gateway.sharding-ring.heartbeat-timeout=4m, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.
The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:
level.Warn(logger).Log("msg", "failed to check block owner but block is kept because was previously loaded", "block", blockID.String(), "err", err)
} else {
level.Warn(logger).Log("msg", "failed to check block owner and block has been excluded because was not previously loaded", "block", blockID.String(), "err", err)
// Skip the block.
synced.WithLabelValues(shardExcludedMeta).Inc()
delete(metas, blockID)
}
continue
}
The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:
// The block is not owned by the store-gateway and there's at least 1 available
// authoritative owner available for queries, so we can filter it out (and unload
// it if it was loaded).
synced.WithLabelValues(shardExcludedMeta).Inc()
delete(metas, blockID)
So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.
Proposal
I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.
The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.
The text was updated successfully, but these errors were encountered:
This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.
Scenario:
memberlist
included)-store-gateway.sharding-ring.heartbeat-timeout
Zoom in into a specific store-gateway
The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one:
store-gateway-zone-b-3
.The sequence of related block syncs are:
Querying the metric
cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"}
we can see the number of successful CAS operations did not increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always2369
during that timeframe:The store-gateway is running with
-store-gateway.sharding-ring.heartbeat-timeout=4m
, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:
mimir/pkg/storegateway/sharding_strategy.go
Lines 98 to 110 in fb39490
The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:
mimir/pkg/storegateway/sharding_strategy.go
Lines 117 to 132 in fb39490
So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.
Proposal
I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.
The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.
The text was updated successfully, but these errors were encountered: