Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_issue_5878 might fail in case storage_controller's service processes reconcile_all() #10513

Open
alexanderlaw opened this issue Jan 27, 2025 · 0 comments
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link
Contributor

The following failure of test_issue_5878:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/465f8982c4a069d5

test_issue_5878[debug-pg17-same_generation]
__arch: 'X64'
Duration: 42s 860ms

with the diagnostics:

test_runner/regress/test_layers_from_future.py:236: in test_issue_5878
    assert (
E   AssertionError: observed PUT overtake the stucked DELETE => bug isn't fixed yet
E   assert 1737668739.1487677 == 1737668744.7167115
E    +  where 1737668739.1487677 = os.stat_result(st_mode=33188, st_ino=125188923, st_dev=43, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4014080, st_atime=1737668739, st_mtime=1737668739, st_ctime=1737668739).st_mtime
E    +  and   1737668744.7167115 = os.stat_result(st_mode=33188, st_ino=125187964, st_dev=43, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4014080, st_atime=1737668744, st_mtime=1737668744, st_ctime=1737668744).st_mtime

that says "bug isn't fixed yet", in fact shows an instability of the test in case of processing reconcile_all() by storage_controller.

pageserver.log contains:

2025-01-23T21:45:44.701937Z  INFO deletion frontend: Stored deletion list sequence=1
2025-01-23T21:45:44.703702Z  INFO deletion backend: Dropped remote consistent LSN updates for tenant 66b59df73444a1e9cc4dd5abd7ff2437 in stale generation 00000001
2025-01-23T21:45:44.703752Z  INFO deletion backend: Dropping stale deletions for tenant 66b59df73444a1e9cc4dd5abd7ff2437 in generation 00000001, objects may be leaked

(but doesn't contain a line "at failpoint deletion-queue-before-execute-pause" that we can find in pageserver.log of a successful run, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/bb66df12607db3eb:

2025-01-23T21:46:26.959315Z  INFO deletion frontend: Stored deletion list sequence=1
2025-01-23T21:46:26.963003Z  INFO deletion executor: at failpoint deletion-queue-before-execute-pause
2025-01-23T21:46:28.663305Z  INFO heatmap_upload_scheduler: Status: 0 tasks running, 0 pending

)

That is, pageserver didn't reach the deletion-queue-before-execute-pause failpoint, because of self.accumulator.is_empty() here:

while !self.accumulator.is_empty() && !self.cancel.is_cancelled() {
pausable_failpoint!("deletion-queue-before-execute-pause");

Skipping a couple of intermediate steps, I see (with debug logging added) that Validator::validate gets no tenants in the storage_controller's response here:

let tenants_valid = if let Some(controller_upcall_client) = &self.controller_upcall_client {
match controller_upcall_client
.validate(tenant_generations.iter().map(|(k, v)| (*k, *v)).collect())
.await
{
Ok(tenants) => tenants,

Looking at storage_controller_1/storage_controller.log from the failed run, we can see:

2025-01-23T21:45:44.447194Z  INFO background_reconcile: Dropping detached tenant shard 66b59df73444a1e9cc4dd5abd7ff2437 from memory
2025-01-23T21:45:44.488095Z  INFO Heartbeat round complete for 1 nodes, 0 warming-up, 0 offline
...
2025-01-23T21:45:44.702921Z  INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Handling request
2025-01-23T21:45:44.703094Z  INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Refusing deletion validation for missing shard 66b59df73444a1e9cc4dd5abd7ff2437
2025-01-23T21:45:44.703190Z  INFO request{method=POST path=/upcall/v1/validate request_id=261bef16-9d02-48bd-b48a-8a12417f4804}: Request handled, status: 200 OK

That is, storage controller decided to drop the tenant as a result of reconcillation and then refused validation of the deletion. And if I understand correctly, the reconciliation is performed once in 20 seconds:

const BACKGROUND_RECONCILE_PERIOD: Duration = Duration::from_secs(20);

Note the duration of the failed test run: 42s 860ms, though for a successful run (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10332/12938296645/index.html#/testresult/9e280b95a88ccde6) we have 23s 535ms, but it changes from run to run, of course.

With the interval decreased to 2 seconds, I got test_runner/regress/test_layers_from_future.py::test_issue_5878[debug-pg17-same_generation] failing on each run locally.

@alexanderlaw alexanderlaw added t/bug Issue Type: Bug a/test/flaky Area: related to flaky tests a/test/flaky/investigated labels Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant