-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: failures in test_timeline_archival_chaos
#10389
Comments
Failure mode C: #10524 |
Failure mode D: #10532 |
Failed mode E: #10594 |
Failure mode B: #10595 |
## Problem The test asserts that it completes at least 10 full timeline lifecycles, but the noisy CI environment sometimes doesn't meet that goal. Related: #10389 ## Summary of changes - Sleep for longer between pageserver restarts, so that the timeline workers have more chance to make progress - Sleep for shorter between retries from timeline worker, so that they have better chance to get in while a pageserver is up between restarts - Relax the success condition to complete at least 5 iterations instead of 10
…10594) ## Problem If offloading races with normal shutdown, we get a "failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited". This is harmless but points to it being quite strange to try and freeze and flush such a timeline. flushing on shutdown for an archived timeline isn't useful. Related: #10389 ## Summary of changes - During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the timeline is archived
…e#10595) ## Problem The test asserts that it completes at least 10 full timeline lifecycles, but the noisy CI environment sometimes doesn't meet that goal. Related: neondatabase#10389 ## Summary of changes - Sleep for longer between pageserver restarts, so that the timeline workers have more chance to make progress - Sleep for shorter between retries from timeline worker, so that they have better chance to get in while a pageserver is up between restarts - Relax the success condition to complete at least 5 iterations instead of 10
…eondatabase#10594) ## Problem If offloading races with normal shutdown, we get a "failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited". This is harmless but points to it being quite strange to try and freeze and flush such a timeline. flushing on shutdown for an archived timeline isn't useful. Related: neondatabase#10389 ## Summary of changes - During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the timeline is archived
Failure mode F: #10702 |
Case A: the test is detecting that something has gone offloaded while it is really offload_ing_ -- presumably the pageserver shows it in the list of offloaded things before that manifest is persistent. In the general case, this is a test bug: the pageserver is allowed to un-offload things if it wants to, it's not illegal. However, what the test is really trying to verify is that things don't un-offload across restarts, so we don't want to drop the check entirely. |
There's a TODO in the code relevant to case A:
|
## Problem There are a couple of log warnings tripping up `test_timeline_archival_chaos` - `[stopping left-over name="timeline_delete" tenant_shard_id=2d526292b67dac0e6425266d7079c253 timeline_id=Some(44ba36bfdee5023672c93778985facd9) kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)` - `ignoring attempt to restart exited flush_loop 503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')` Related: #10389 ## Summary of changes - Downgrade the 'ignoring attempt to restart' to info -- there's nothing in the design that forbids this happening, i.e. someone calling maybe_spawn_flush_loop concurrently with shutdown() - Prevent timeline deletion tasks outliving tenants by carrying a gateguard. This logically makes sense because the deletion process does call into Tenant to update manifests.
…10719) ## Problem This test would sometimes fail its assertion that a timeline does not revert to active once archived. That's because it was using the in-memory offload state, not the persistent state, so this was sometimes lost across a pageserver restart. Closes: #10389 ## Summary of changes - When reading offload status, read from pageserver API _and_ remote storage before considering the timeline offloaded
Multiple failure modes:
WARN shutdown_pageserver{exit_code=0}: stopping left-over name="timeline_delete" tenant_shard_id=4fb531bf3b088b282c8a950ec97545e7 timeline_id=Some(cf17d70625d82ca45684b9bd7d2ce30f) kind=TimelineDeletionWorker\n')
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10608/13072224498/index.html#/testresult/ffcc90cce37c63d0The text was updated successfully, but these errors were encountered: