Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: failures in test_timeline_archival_chaos #10389

Closed
5 of 6 tasks
jcsp opened this issue Jan 14, 2025 · 7 comments · Fixed by #10719
Closed
5 of 6 tasks

tests: failures in test_timeline_archival_chaos #10389

jcsp opened this issue Jan 14, 2025 · 7 comments · Fixed by #10719
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@jcsp
Copy link
Collaborator

jcsp commented Jan 14, 2025

Multiple failure modes:

@jcsp jcsp added a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 14, 2025
@jcsp jcsp self-assigned this Jan 14, 2025
@erikgrinaker erikgrinaker added the triaged bugs that were already triaged label Jan 21, 2025
@jcsp
Copy link
Collaborator Author

jcsp commented Jan 27, 2025

Failure mode C: #10524

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 28, 2025

Failure mode D: #10532

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 30, 2025

Failed mode E: #10594

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 30, 2025

Failure mode B: #10595

github-merge-queue bot pushed a commit that referenced this issue Jan 30, 2025
## Problem

The test asserts that it completes at least 10 full timeline lifecycles,
but the noisy CI environment sometimes doesn't meet that goal.

Related: #10389

## Summary of changes

- Sleep for longer between pageserver restarts, so that the timeline
workers have more chance to make progress
- Sleep for shorter between retries from timeline worker, so that they
have better chance to get in while a pageserver is up between restarts
- Relax the success condition to complete at least 5 iterations instead
of 10
github-merge-queue bot pushed a commit that referenced this issue Jan 31, 2025
…10594)

## Problem

If offloading races with normal shutdown, we get a "failed to freeze and
flush: cannot flush frozen layers when flush_loop is not running, state
is Exited". This is harmless but points to it being quite strange to try
and freeze and flush such a timeline. flushing on shutdown for an
archived timeline isn't useful.

Related: #10389

## Summary of changes

- During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the
timeline is archived
winter-loo pushed a commit to winter-loo/neon that referenced this issue Feb 4, 2025
…e#10595)

## Problem

The test asserts that it completes at least 10 full timeline lifecycles,
but the noisy CI environment sometimes doesn't meet that goal.

Related: neondatabase#10389

## Summary of changes

- Sleep for longer between pageserver restarts, so that the timeline
workers have more chance to make progress
- Sleep for shorter between retries from timeline worker, so that they
have better chance to get in while a pageserver is up between restarts
- Relax the success condition to complete at least 5 iterations instead
of 10
winter-loo pushed a commit to winter-loo/neon that referenced this issue Feb 4, 2025
…eondatabase#10594)

## Problem

If offloading races with normal shutdown, we get a "failed to freeze and
flush: cannot flush frozen layers when flush_loop is not running, state
is Exited". This is harmless but points to it being quite strange to try
and freeze and flush such a timeline. flushing on shutdown for an
archived timeline isn't useful.

Related: neondatabase#10389

## Summary of changes

- During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the
timeline is archived
@jcsp
Copy link
Collaborator Author

jcsp commented Feb 7, 2025

Failure mode F: #10702

@jcsp
Copy link
Collaborator Author

jcsp commented Feb 7, 2025

Case A: the test is detecting that something has gone offloaded while it is really offload_ing_ -- presumably the pageserver shows it in the list of offloaded things before that manifest is persistent.

In the general case, this is a test bug: the pageserver is allowed to un-offload things if it wants to, it's not illegal. However, what the test is really trying to verify is that things don't un-offload across restarts, so we don't want to drop the check entirely.

@jcsp
Copy link
Collaborator Author

jcsp commented Feb 7, 2025

There's a TODO in the code relevant to case A:

    // Last step: mark timeline as offloaded in S3
    // TODO: maybe move this step above, right above deletion of the local timeline directory,
    // then there is no potential race condition where we partially offload a timeline, and
    // at the next restart attach it again.
    // For that to happen, we'd need to make the manifest reflect our *intended* state,
    // not our actual state of offloaded timelines.
    tenant.store_tenant_manifest().await?;

github-merge-queue bot pushed a commit that referenced this issue Feb 7, 2025
## Problem

There are a couple of log warnings tripping up
`test_timeline_archival_chaos`

- `[stopping left-over name="timeline_delete"
tenant_shard_id=2d526292b67dac0e6425266d7079c253
timeline_id=Some(44ba36bfdee5023672c93778985facd9)
kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)`
- `ignoring attempt to restart exited flush_loop
503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')`

Related: #10389

## Summary of changes

- Downgrade the 'ignoring attempt to restart' to info -- there's nothing
in the design that forbids this happening, i.e. someone calling
maybe_spawn_flush_loop concurrently with shutdown()
- Prevent timeline deletion tasks outliving tenants by carrying a
gateguard. This logically makes sense because the deletion process does
call into Tenant to update manifests.
github-merge-queue bot pushed a commit that referenced this issue Feb 7, 2025
…10719)

## Problem

This test would sometimes fail its assertion that a timeline does not
revert to active once archived. That's because it was using the
in-memory offload state, not the persistent state, so this was sometimes
lost across a pageserver restart.

Closes: #10389

## Summary of changes

- When reading offload status, read from pageserver API _and_ remote
storage before considering the timeline offloaded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants