You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
test_runner/regress/test_layers_from_future.py:236: in test_issue_5878
assert (
E AssertionError: observed PUT overtake the stucked DELETE => bug isn't fixed yet
E assert 1737668739.1487677 == 1737668744.7167115
E + where 1737668739.1487677 = os.stat_result(st_mode=33188, st_ino=125188923, st_dev=43, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4014080, st_atime=1737668739, st_mtime=1737668739, st_ctime=1737668739).st_mtime
E + and 1737668744.7167115 = os.stat_result(st_mode=33188, st_ino=125187964, st_dev=43, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4014080, st_atime=1737668744, st_mtime=1737668744, st_ctime=1737668744).st_mtime
that says "bug isn't fixed yet", in fact shows an instability of the test in case of processing reconcile_all() by storage_controller.
pageserver.log contains:
2025-01-23T21:45:44.701937Z INFO deletion frontend: Stored deletion list sequence=1
2025-01-23T21:45:44.703702Z INFO deletion backend: Dropped remote consistent LSN updates for tenant 66b59df73444a1e9cc4dd5abd7ff2437 in stale generation 00000001
2025-01-23T21:45:44.703752Z INFO deletion backend: Dropping stale deletions for tenant 66b59df73444a1e9cc4dd5abd7ff2437 in generation 00000001, objects may be leaked
2025-01-23T21:46:26.959315Z INFO deletion frontend: Stored deletion list sequence=1
2025-01-23T21:46:26.963003Z INFO deletion executor: at failpoint deletion-queue-before-execute-pause
2025-01-23T21:46:28.663305Z INFO heatmap_upload_scheduler: Status: 0 tasks running, 0 pending
)
That is, pageserver didn't reach the deletion-queue-before-execute-pause failpoint, because of self.accumulator.is_empty() here:
Skipping a couple of intermediate steps, I see (with debug logging added) that Validator::validate gets no tenants in the storage_controller's response here:
Looking at storage_controller_1/storage_controller.log from the failed run, we can see:
2025-01-23T21:45:44.447194Z INFO background_reconcile: Dropping detached tenant shard 66b59df73444a1e9cc4dd5abd7ff2437 from memory
2025-01-23T21:45:44.488095Z INFO Heartbeat round complete for 1 nodes, 0 warming-up, 0 offline
...
2025-01-23T21:45:44.702921Z INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Handling request
2025-01-23T21:45:44.703094Z INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Refusing deletion validation for missing shard 66b59df73444a1e9cc4dd5abd7ff2437
2025-01-23T21:45:44.703190Z INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Request handled, status: 200 OK
That is, storage controller decided to drop the tenant as a result of reconcillation and then refused validation of the deletion. And if I understand correctly, the reconciliation is performed once in 20 seconds:
With the interval decreased to 2 seconds, I got test_runner/regress/test_layers_from_future.py::test_issue_5878[debug-pg17-same_generation] failing on each run locally.
The text was updated successfully, but these errors were encountered:
The following failure of test_issue_5878:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/465f8982c4a069d5
test_issue_5878[debug-pg17-same_generation]
__arch: 'X64'
Duration: 42s 860ms
with the diagnostics:
that says "bug isn't fixed yet", in fact shows an instability of the test in case of processing reconcile_all() by storage_controller.
pageserver.log contains:
(but doesn't contain a line "at failpoint deletion-queue-before-execute-pause" that we can find in pageserver.log of a successful run, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/bb66df12607db3eb:
)
That is, pageserver didn't reach the deletion-queue-before-execute-pause failpoint, because of self.accumulator.is_empty() here:
neon/pageserver/src/deletion_queue/deleter.rs
Lines 92 to 93 in be718ed
Skipping a couple of intermediate steps, I see (with debug logging added) that Validator::validate gets no tenants in the storage_controller's response here:
neon/pageserver/src/deletion_queue/validator.rs
Lines 148 to 153 in be718ed
Looking at storage_controller_1/storage_controller.log from the failed run, we can see:
That is, storage controller decided to drop the tenant as a result of reconcillation and then refused validation of the deletion. And if I understand correctly, the reconciliation is performed once in 20 seconds:
neon/storage_controller/src/service.rs
Line 952 in be718ed
Note the duration of the failed test run: 42s 860ms, though for a successful run (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/9e280b95a88ccde6) we have 23s 535ms, but it changes from run to run, of course.
With the interval decreased to 2 seconds, I got
test_runner/regress/test_layers_from_future.py::test_issue_5878[debug-pg17-same_generation]
failing on each run locally.The text was updated successfully, but these errors were encountered: