Store-gateway drops all blocks if fails to heartbeat the ring #1805

pracucci · 2022-05-03T08:39:00Z

This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.

Scenario:

Mimir configured to run ring on Consul (but the issue could happen with any ring backend, memberlist included)
Store-gateway is overloaded while lazy-loading a large number of big block's index-headers
Store-gateway fails to heartbeat the ring for X consecutive minutes, where X is > -store-gateway.sharding-ring.heartbeat-timeout
At the 1st next sync, the store-gateway detects itself as unhealthy and drop all its blocks
At the 2nd next sync, the store-gateway detects itself as healthy and re-adds all its blocks, causing additional load to the store-gateway itself

Zoom in into a specific store-gateway

The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one: store-gateway-zone-b-3.

The sequence of related block syncs are:

level=info ts=2022-05-02T09:06:19.395846082Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-changeShow context
level=info ts=2022-05-02T09:07:03.365919167Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been dropped.
level=info ts=2022-05-02T09:07:03.3659588Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:08:55.148949647Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been re-added. The sync took 37 minutes.
level=info ts=2022-05-02T09:08:55.148992704Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826814806Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

level=info ts=2022-05-02T09:45:34.826872774Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.853222892Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

Querying the metric cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"} we can see the number of successful CAS operations did not increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always 2369 during that timeframe:

The store-gateway is running with -store-gateway.sharding-ring.heartbeat-timeout=4m, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.

The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:

mimir/pkg/storegateway/sharding_strategy.go

Lines 98 to 110 in fb39490

    
           if err != nil { 
        
           	if _, ok := loaded[blockID]; ok { 
        
           		level.Warn(logger).Log("msg", "failed to check block owner but block is kept because was previously loaded", "block", blockID.String(), "err", err) 
        
           	} else { 
        
           		level.Warn(logger).Log("msg", "failed to check block owner and block has been excluded because was not previously loaded", "block", blockID.String(), "err", err) 
        
           		// Skip the block. 
        
           		synced.WithLabelValues(shardExcludedMeta).Inc() 
        
           		delete(metas, blockID) 
        
           	} 
        
           	continue 
        
           }

The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:

mimir/pkg/storegateway/sharding_strategy.go

Lines 117 to 132 in fb39490

    
           // The block is not owned by the store-gateway. However, if it's currently loaded 
        
           // we can safely unload it only once at least 1 authoritative owner is available 
        
           // for queries. 
        
           if _, ok := loaded[blockID]; ok { 
        
           	// The ring Get() returns an error if there's no available instance. 
        
           	if _, err := r.Get(key, BlocksOwnerRead, bufDescs, bufHosts, bufZones); err != nil { 
        
           		// Keep the block. 
        
           		continue 
        
           	} 
        
           } 
        
           // The block is not owned by the store-gateway and there's at least 1 available 
        
           // authoritative owner available for queries, so we can filter it out (and unload 
        
           // it if it was loaded). 
        
           synced.WithLabelValues(shardExcludedMeta).Inc() 
        
           delete(metas, blockID)

So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.

Proposal

I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.

The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.

The text was updated successfully, but these errors were encountered:

pracucci · 2022-05-03T08:56:36Z

My plan to give a try to the proposed enhancement:

Reproduce locally
Fix it
Try again locally
Write unit tests

pracucci added the component/store-gateway label May 3, 2022

pracucci self-assigned this May 3, 2022

This was referenced May 3, 2022

Add protection to store-gateway to not drop all blocks if unhealthy in the ring #1806

Merged

Chore: remove unused code from BucketStore #1816

Merged

pracucci closed this as completed in #1816 May 5, 2022

pracucci reopened this May 5, 2022

This was referenced May 5, 2022

Refactoring: force removal of all blocks when BucketStore is closed #1817

Merged

Refactoring: simplify FilterUsers() logic in store-gateway #1819

Merged

Do not drop blocks in the store-gateway if missing in the ring #1823

Merged

pracucci closed this as completed in #1823 May 6, 2022

pracucci added the bug Something isn't working label May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store-gateway drops all blocks if fails to heartbeat the ring #1805

Store-gateway drops all blocks if fails to heartbeat the ring #1805

pracucci commented May 3, 2022

pracucci commented May 3, 2022 •

edited

Loading

Store-gateway drops all blocks if fails to heartbeat the ring #1805

Store-gateway drops all blocks if fails to heartbeat the ring #1805

Comments

pracucci commented May 3, 2022

Zoom in into a specific store-gateway

Proposal

pracucci commented May 3, 2022 • edited Loading

pracucci commented May 3, 2022 •

edited

Loading