Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry the cleanup of downloadAndExtract #24295

Closed

Conversation

sthornington
Copy link
Contributor

This helps the HTTP downloader better cope with filesystems where unlink is non-atomic.

@github-actions github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Nov 12, 2024
@sthornington
Copy link
Contributor Author

This fixes #23687 and maybe #20013. It would be wonderful if we could get this back-ported to 7.x as well, since it fixes a pretty big problem for us.

@sthornington sthornington changed the title retry the cleanup of downloadAndExtract Retry the cleanup of downloadAndExtract Nov 12, 2024
downloadDirectory.deleteTree();
// Retry the deleteTree a while, if necessary, to cope with filesytems where unlinks are not atomic
Instant start = Instant.now();
Instant deadline = start.plus(Duration.ofSeconds(5));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An easy objection is that 5 seconds is a long time, and I am not doing any sleeping or backoff in this loop, but my reasoning is: previously, this would be an immediate abort of the entire build, so the length of time or CPU usage of retrying something that was previously completely unhandled is a bit moot. In a few tests in our environment, 90% of these circumstances are resolved in ~100ms but occasionally the file deletes are not reflected for over 500ms so I figured "better safe than sorry", and I didn't want to make this diff any more complicated than it already was.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example of such a filesystem? What exactly does it mean for unlink to not be atomic?

Copy link
Contributor Author

@sthornington sthornington Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some distributed NFS appliances have this property, and they ack the delete before all the caches are updated. The symptom is that you rm a file, but it may not be absent in an immediately following directory listing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: Could we download to a temporary directory outside the repo that we can clean up asynchronously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work too, I was actually surprised the download was done into output_dir but I figured it was due to some hermeticity argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything I can do to move this along? It solves a massive problem for us, and I am sure others who use similar filesystems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is deleteTree the only call that needs this special treatment?

either way, I think I'd prefer if you could extract this special handling into a separate method (deleteTreeWithRetries or something), and add some documentation explaining the rationale (preferably also linking to the GH issue). That way, it's less implementation burden for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really me using this function, it's the dependency downloading of http_archive and all the crate downloading of rules_rust. But, if what you mean is that I should simply wrap up the retries into a function which is instead used by downloadAndExtract I am happy to do that. I can do that right now one minute.

Copy link
Contributor Author

@sthornington sthornington Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I am adding it in the StarlarkBaseExternalContext since Path.java seems pretty closely aligned with the file system calls...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this looks good to me.

It's not really me using this function

I understand that -- it's just that having the extra logic specific to this call of deleteTree invites the question of "why isn't this done elsewhere". Extracting the logic into a separate method makes it easier to apply the same retry logic to other call sites.

@sthornington
Copy link
Contributor Author

I actually tested this change in release-7.4.1 but I was not sure which branch was best to submit the PR on. It doesn't look like this section has changed between 7.4.1 and the head.

@sthornington sthornington force-pushed the simon_httpretry_onmaster branch from ed999d4 to 67822d1 Compare November 14, 2024 21:43
@sthornington sthornington force-pushed the simon_httpretry_onmaster branch from 67822d1 to 14f697f Compare November 14, 2024 22:03
downloadDirectory.deleteTree();
// Retry the deleteTree a while, if necessary, to cope with filesytems where unlinks are not atomic
Instant start = Instant.now();
Instant deadline = start.plus(Duration.ofSeconds(5));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this looks good to me.

It's not really me using this function

I understand that -- it's just that having the extra logic specific to this call of deleteTree invites the question of "why isn't this done elsewhere". Extracting the logic into a separate method makes it easier to apply the same retry logic to other call sites.

@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Nov 15, 2024
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Nov 19, 2024
@sthornington sthornington deleted the simon_httpretry_onmaster branch December 2, 2024 19:43
@sthornington
Copy link
Contributor Author

Is there anything I can do to get this into a 7.4.2 ?

@sthornington
Copy link
Contributor Author

Is there anything I can do to get this into a 7.4.2 ?

@Wyverald sorry to ping you - if I go through all the steps outlined in https://github.com/bazelbuild/continuous-integration/blob/master/docs/release-playbook.md can I cherry-pick this to a 7.4.2 and release it? This checklist is so daunting I assumed it was for Google employees.

@sthornington
Copy link
Contributor Author

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Dec 10, 2024
@iancha1992
Copy link
Member

@bazel-io fork 8.0.1

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Dec 10, 2024
bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Dec 10, 2024
This helps the HTTP downloader better cope with filesystems where unlink is non-atomic.

Closes bazelbuild#24295.

PiperOrigin-RevId: 697920434
Change-Id: I91b4dbf07a2efdca07c0310e15aac5f4d89c4091
github-merge-queue bot pushed a commit that referenced this pull request Dec 11, 2024
This helps the HTTP downloader better cope with filesystems where unlink
is non-atomic.

Closes #24295.

PiperOrigin-RevId: 697920434
Change-Id: I91b4dbf07a2efdca07c0310e15aac5f4d89c4091

Commit
99a27f6

Co-authored-by: Simon Thornington <[email protected]>
@iancha1992
Copy link
Member

@bazel-io fork 7.5.0

bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Dec 12, 2024
This helps the HTTP downloader better cope with filesystems where unlink is non-atomic.

Closes bazelbuild#24295.

PiperOrigin-RevId: 697920434
Change-Id: I91b4dbf07a2efdca07c0310e15aac5f4d89c4091
iancha1992 pushed a commit that referenced this pull request Dec 16, 2024
This helps the HTTP downloader better cope with filesystems where unlink
is non-atomic.

Closes #24295.

PiperOrigin-RevId: 697920434
Change-Id: I91b4dbf07a2efdca07c0310e15aac5f4d89c4091

Commit
99a27f6

Co-authored-by: Simon Thornington <[email protected]>
ramil-bitrise pushed a commit to bitrise-io/bazel that referenced this pull request Dec 18, 2024
This helps the HTTP downloader better cope with filesystems where unlink is non-atomic.

Closes bazelbuild#24295.

PiperOrigin-RevId: 697920434
Change-Id: I91b4dbf07a2efdca07c0310e15aac5f4d89c4091
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants