test_issue_5878 might fail in case storage_controller's service processes reconcile_all() #10513

alexanderlaw · 2025-01-27T13:00:01Z

The following failure of test_issue_5878:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/465f8982c4a069d5

test_issue_5878[debug-pg17-same_generation]
__arch: 'X64'
Duration: 42s 860ms

with the diagnostics:

test_runner/regress/test_layers_from_future.py:236: in test_issue_5878
    assert (
E   AssertionError: observed PUT overtake the stucked DELETE => bug isn't fixed yet
E   assert 1737668739.1487677 == 1737668744.7167115
E    +  where 1737668739.1487677 = os.stat_result(st_mode=33188, st_ino=125188923, st_dev=43, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4014080, st_atime=1737668739, st_mtime=1737668739, st_ctime=1737668739).st_mtime
E    +  and   1737668744.7167115 = os.stat_result(st_mode=33188, st_ino=125187964, st_dev=43, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4014080, st_atime=1737668744, st_mtime=1737668744, st_ctime=1737668744).st_mtime

that says "bug isn't fixed yet", in fact shows an instability of the test in case of processing reconcile_all() by storage_controller.

pageserver.log contains:

2025-01-23T21:45:44.701937Z  INFO deletion frontend: Stored deletion list sequence=1
2025-01-23T21:45:44.703702Z  INFO deletion backend: Dropped remote consistent LSN updates for tenant 66b59df73444a1e9cc4dd5abd7ff2437 in stale generation 00000001
2025-01-23T21:45:44.703752Z  INFO deletion backend: Dropping stale deletions for tenant 66b59df73444a1e9cc4dd5abd7ff2437 in generation 00000001, objects may be leaked

(but doesn't contain a line "at failpoint deletion-queue-before-execute-pause" that we can find in pageserver.log of a successful run, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/bb66df12607db3eb:

2025-01-23T21:46:26.959315Z  INFO deletion frontend: Stored deletion list sequence=1
2025-01-23T21:46:26.963003Z  INFO deletion executor: at failpoint deletion-queue-before-execute-pause
2025-01-23T21:46:28.663305Z  INFO heatmap_upload_scheduler: Status: 0 tasks running, 0 pending

)

That is, pageserver didn't reach the deletion-queue-before-execute-pause failpoint, because of self.accumulator.is_empty() here:

neon/pageserver/src/deletion_queue/deleter.rs

Lines 92 to 93 in be718ed

    
           while !self.accumulator.is_empty() && !self.cancel.is_cancelled() { 
        
               pausable_failpoint!("deletion-queue-before-execute-pause");

Skipping a couple of intermediate steps, I see (with debug logging added) that Validator::validate gets no tenants in the storage_controller's response here:

neon/pageserver/src/deletion_queue/validator.rs

Lines 148 to 153 in be718ed

    
           let tenants_valid = if let Some(controller_upcall_client) = &self.controller_upcall_client { 
        
               match controller_upcall_client 
        
                   .validate(tenant_generations.iter().map(|(k, v)| (*k, *v)).collect()) 
        
                   .await 
        
               { 
        
                   Ok(tenants) => tenants,

Looking at storage_controller_1/storage_controller.log from the failed run, we can see:

2025-01-23T21:45:44.447194Z  INFO background_reconcile: Dropping detached tenant shard 66b59df73444a1e9cc4dd5abd7ff2437 from memory
2025-01-23T21:45:44.488095Z  INFO Heartbeat round complete for 1 nodes, 0 warming-up, 0 offline
...
2025-01-23T21:45:44.702921Z  INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Handling request
2025-01-23T21:45:44.703094Z  INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Refusing deletion validation for missing shard 66b59df73444a1e9cc4dd5abd7ff2437
2025-01-23T21:45:44.703190Z  INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Request handled, status: 200 OK

That is, storage controller decided to drop the tenant as a result of reconcillation and then refused validation of the deletion. And if I understand correctly, the reconciliation is performed once in 20 seconds:

neon/storage_controller/src/service.rs

Line 952 in be718ed

const BACKGROUND_RECONCILE_PERIOD: Duration = Duration::from_secs(20);

Note the duration of the failed test run: 42s 860ms, though for a successful run (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/9e280b95a88ccde6) we have 23s 535ms, but it changes from run to run, of course.

With the interval decreased to 2 seconds, I got test_runner/regress/test_layers_from_future.py::test_issue_5878[debug-pg17-same_generation] failing on each run locally.

The text was updated successfully, but these errors were encountered:

alexanderlaw added t/bug Issue Type: Bug a/test/flaky Area: related to flaky tests a/test/flaky/investigated labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_issue_5878 might fail in case storage_controller's service processes reconcile_all() #10513

test_issue_5878 might fail in case storage_controller's service processes reconcile_all() #10513

alexanderlaw commented Jan 27, 2025

test_issue_5878 might fail in case storage_controller's service processes reconcile_all() #10513

test_issue_5878 might fail in case storage_controller's service processes reconcile_all() #10513

Comments

alexanderlaw commented Jan 27, 2025