Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gsutil upload retry helper function #6817

Merged
merged 3 commits into from
Jan 25, 2024
Merged

Add gsutil upload retry helper function #6817

merged 3 commits into from
Jan 25, 2024

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Jan 19, 2024

See the ticket for more details.

@kthui kthui changed the title Add gsutil cp retry helper function Add gsutil upload retry helper function Jan 19, 2024
@kthui kthui requested review from GuanLuo and mc-nv January 19, 2024 22:42
@kthui kthui marked this pull request as ready for review January 19, 2024 22:58
@kthui kthui requested a review from rmccorm4 January 19, 2024 22:59
rm -f gcs_upload.gsutil_cp.log
until gsutil -m cp -c -L gcs_upload.gsutil_cp.log -r $local_path $gcs_path; do
sleep 1
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add max retry count?

Copy link
Contributor Author

@kthui kthui Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, because if it fails to upload after max retry, there is nothing to do but fail the test. In that case, it would be better to keep retrying until it timed out, which gives the best chances for the test to pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timeout? If you are referring to the CI timeout, that will be too long, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, adding max retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fails:

+ gcs_upload models/ gs://<hidden>
+ local local_path=models/
+ local gcs_path=gs://<hidden>
+ local retry_left=30
+ local log_file=gcs_upload.gsutil_cp.log
+ rm -f gcs_upload.gsutil_cp.log
+ touch gcs_upload.gsutil_cp.log
+ gsutil -m cp -c -L gcs_upload.gsutil_cp.log -r models/ gs://<hidden>
Copying file://models/dummy [Content-Type=application/octet-stream]...
Error copying file://models/: NotFoundException: 404 The destination bucket gs://<hidden> does not exist or the write to the destination must be restarted
CommandException: 1 file/object could not be transferred.
+ sleep 1
+ (( retry_left-- ))
+ [[ 29 -eq 0 ]]
+ gsutil -m cp -c -L gcs_upload.gsutil_cp.log -r models/ gs://<hidden>
Copying file://models/dummy [Content-Type=application/octet-stream]...
Error copying file://models/: NotFoundException: 404 The destination bucket gs://<hidden> does not exist or the write to the destination must be restarted
CommandException: 1 file/object could not be transferred.
+ sleep 1
+ (( retry_left-- ))
+ [[ 28 -eq 0 ]]
...
+ [[ 0 -eq 0 ]]
+ cat gcs_upload.gsutil_cp.log
... <hidden> ...
+ echo -e '=== Failed upload to GCS bucket'
=== Failed upload to GCS bucket
+ return 121

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If success:

+ gcs_upload models/ gs://<hidden>
+ local local_path=models/
+ local gcs_path=gs://<hidden>
+ local retry_left=30
+ local log_file=gcs_upload.gsutil_cp.log
+ rm -f gcs_upload.gsutil_cp.log
+ touch gcs_upload.gsutil_cp.log
+ gsutil -m cp -c -L gcs_upload.gsutil_cp.log -r models/ gs://<hidden>
Copying file://models/dummy [Content-Type=application/octet-stream]...
/ [0/1 files][    0.0 B/    0.0 B]                                              
/ [1/1 files][    0.0 B/    0.0 B]                                              
Operation completed over 1 objects.
...

@kthui kthui requested a review from GuanLuo January 19, 2024 23:06
local local_path=$1
local gcs_path=$2
rm -f gcs_upload.gsutil_cp.log
until gsutil -m cp -c -L gcs_upload.gsutil_cp.log -r $local_path $gcs_path; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that -m flag by default includes -c flag.
The issue with it that it will keep return non-zero code if error happen.

-c
If an error occurs, continue attempting to copy the remaining files. If any copies are unsuccessful, gsutil's exit status is non-zero, even if this flag is set. This option is implicitly set when running gsutil -m cp....
https://cloud.google.com/storage/docs/gsutil/commands/cp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we should drop this flag in favor of below?
gsutil cp -L gcs_upload.gsutil_cp.log -r $local_path $gcs_path

Copy link
Contributor Author

@kthui kthui Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the -c flag is exactly what we want, because the goal is to capture the error in a loop until it success. (or until a max retry which I am looking into that)

@kthui kthui requested a review from mc-nv January 20, 2024 01:03
@kthui
Copy link
Contributor Author

kthui commented Jan 22, 2024

@GuanLuo @mc-nv How about we keep it simple and just use sequential upload? It's because performance is not the most important for those tests and sequential upload should keep the stress on the network low. We can add retry in the future if that does not help.

@kthui kthui merged commit 4bc15c9 into main Jan 25, 2024
3 checks passed
@kthui kthui deleted the jacky-gsutil-retry branch January 25, 2024 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants