tests: failures in `test_timeline_archival_chaos` #10389

jcsp · 2025-01-14T15:14:26Z

Multiple failure modes:

A) Assertion on timeline states: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10379/12768980436/index.html#testresult/1897a41a54810ddd/retries
B) Didn't make it through enough iterations: https://neon-github-public-dev.s3.amazonaws.com/reports/main/12748358128/index.html#testresult/327cbe4cb33ec42/retries
C) Log error for failure uploading manifest in compaction: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10244/12748633778/index.html#testresult/550fccea6ef552ad/retries
D) Failed to clean up uninitialized timeline directory https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries
E) failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10558/13030683056/index.html#/testresult/916489f58539180
F) leftover tasks warning WARN shutdown_pageserver{exit_code=0}: stopping left-over name="timeline_delete" tenant_shard_id=4fb531bf3b088b282c8a950ec97545e7 timeline_id=Some(cf17d70625d82ca45684b9bd7d2ce30f) kind=TimelineDeletionWorker\n')https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10608/13072224498/index.html#/testresult/ffcc90cce37c63d0

The text was updated successfully, but these errors were encountered:

jcsp · 2025-01-27T17:50:38Z

Failure mode C: #10524

jcsp · 2025-01-28T10:44:59Z

Failure mode D: #10532

jcsp · 2025-01-30T18:14:20Z

Failed mode E: #10594

jcsp · 2025-01-30T18:29:07Z

Failure mode B: #10595

## Problem The test asserts that it completes at least 10 full timeline lifecycles, but the noisy CI environment sometimes doesn't meet that goal. Related: #10389 ## Summary of changes - Sleep for longer between pageserver restarts, so that the timeline workers have more chance to make progress - Sleep for shorter between retries from timeline worker, so that they have better chance to get in while a pageserver is up between restarts - Relax the success condition to complete at least 5 iterations instead of 10

…10594) ## Problem If offloading races with normal shutdown, we get a "failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited". This is harmless but points to it being quite strange to try and freeze and flush such a timeline. flushing on shutdown for an archived timeline isn't useful. Related: #10389 ## Summary of changes - During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the timeline is archived

…e#10595) ## Problem The test asserts that it completes at least 10 full timeline lifecycles, but the noisy CI environment sometimes doesn't meet that goal. Related: neondatabase#10389 ## Summary of changes - Sleep for longer between pageserver restarts, so that the timeline workers have more chance to make progress - Sleep for shorter between retries from timeline worker, so that they have better chance to get in while a pageserver is up between restarts - Relax the success condition to complete at least 5 iterations instead of 10

…eondatabase#10594) ## Problem If offloading races with normal shutdown, we get a "failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited". This is harmless but points to it being quite strange to try and freeze and flush such a timeline. flushing on shutdown for an archived timeline isn't useful. Related: neondatabase#10389 ## Summary of changes - During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the timeline is archived

jcsp · 2025-02-07T11:08:22Z

Failure mode F: #10702

jcsp · 2025-02-07T11:19:04Z

Case A: the test is detecting that something has gone offloaded while it is really offload_ing_ -- presumably the pageserver shows it in the list of offloaded things before that manifest is persistent.

In the general case, this is a test bug: the pageserver is allowed to un-offload things if it wants to, it's not illegal. However, what the test is really trying to verify is that things don't un-offload across restarts, so we don't want to drop the check entirely.

jcsp · 2025-02-07T11:21:31Z

There's a TODO in the code relevant to case A:

    // Last step: mark timeline as offloaded in S3
    // TODO: maybe move this step above, right above deletion of the local timeline directory,
    // then there is no potential race condition where we partially offload a timeline, and
    // at the next restart attach it again.
    // For that to happen, we'd need to make the manifest reflect our *intended* state,
    // not our actual state of offloaded timelines.
    tenant.store_tenant_manifest().await?;

## Problem There are a couple of log warnings tripping up `test_timeline_archival_chaos` - `[stopping left-over name="timeline_delete" tenant_shard_id=2d526292b67dac0e6425266d7079c253 timeline_id=Some(44ba36bfdee5023672c93778985facd9) kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)` - `ignoring attempt to restart exited flush_loop 503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')` Related: #10389 ## Summary of changes - Downgrade the 'ignoring attempt to restart' to info -- there's nothing in the design that forbids this happening, i.e. someone calling maybe_spawn_flush_loop concurrently with shutdown() - Prevent timeline deletion tasks outliving tenants by carrying a gateguard. This logically makes sense because the deletion process does call into Tenant to update manifests.

…10719) ## Problem This test would sometimes fail its assertion that a timeline does not revert to active once archived. That's because it was using the in-memory offload state, not the persistent state, so this was sometimes lost across a pageserver restart. Closes: #10389 ## Summary of changes - When reading offload status, read from pageserver API _and_ remote storage before considering the timeline offloaded

jcsp added a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 14, 2025

jcsp self-assigned this Jan 14, 2025

erikgrinaker added the triaged bugs that were already triaged label Jan 21, 2025

jcsp mentioned this issue Jan 30, 2025

pageserver: exclude archived timelines from freeze+flush on shutdown #10594

Merged

jcsp mentioned this issue Jan 30, 2025

tests: relax constraints on test_timeline_archival_chaos #10595

Merged

jcsp mentioned this issue Feb 6, 2025

tests: address warnings in timeline shutdown #10702

Merged

jcsp mentioned this issue Feb 7, 2025

tests: wait for manifest persistence in test_timeline_archival_chaos #10719

Merged

jcsp closed this as completed in #10719 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: failures in `test_timeline_archival_chaos` #10389

tests: failures in `test_timeline_archival_chaos` #10389

jcsp commented Jan 14, 2025 •

edited

Loading

jcsp commented Jan 27, 2025

jcsp commented Jan 28, 2025

jcsp commented Jan 30, 2025

jcsp commented Jan 30, 2025

jcsp commented Feb 7, 2025

jcsp commented Feb 7, 2025

jcsp commented Feb 7, 2025

tests: failures in test_timeline_archival_chaos #10389

tests: failures in test_timeline_archival_chaos #10389

Comments

jcsp commented Jan 14, 2025 • edited Loading

jcsp commented Jan 27, 2025

jcsp commented Jan 28, 2025

jcsp commented Jan 30, 2025

jcsp commented Jan 30, 2025

jcsp commented Feb 7, 2025

jcsp commented Feb 7, 2025

jcsp commented Feb 7, 2025

tests: failures in `test_timeline_archival_chaos` #10389

tests: failures in `test_timeline_archival_chaos` #10389

jcsp commented Jan 14, 2025 •

edited

Loading