Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

record restic backup progress in PodVolumeBackup #1821

Merged
merged 9 commits into from
Sep 10, 2019

Conversation

prydonius
Copy link
Contributor

This PR modifies the way we run restic backup to include reporting the progress of the backup.

Signed-off-by: Adnan Abdulhussein [email protected]

@prydonius
Copy link
Contributor Author

Still WIP, but wanted some feedback on the approach I'm taking here with the callback (updateFunc) function to update the progress from outside the controller. https://github.com/fusor/velero/pull/4/files implemented the RunBackup function in the controller package which avoids the need for the callback, but I'm thinking that it will be more easily testable this way, albeit a bit more confusing?

@prydonius prydonius force-pushed the 1749-restic-progress branch from dcce611 to e8a6a99 Compare August 29, 2019 00:56

// RunBackup runs a `restic backup` command and watches the output to provide
// progress updates to the caller.
func RunBackup(backupCmd *Command, log logrus.FieldLogger, updateFunc func(velerov1api.PodVolumeOperationProgress)) (string, string, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation largely follows https://github.com/fusor/velero/pull/4/files. The main differences are:

  • we keep track of the number of bytes we've read so we can skip lines we've read when scanning through
  • we quit the progress checking goroutine as soon as the backup command finishes to avoid it having to wait for another cycle to find that the backup is 100% complete (a ticker is used instead of a sleep so we don't block for the quit signal). The controller takes care of updating the progress to 100% once the backup has completed successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with this is that if a backup completes too quickly, the progress will never get set. This can easily happen for smaller volumes or if only a few files have changed in an incremental backup. I'm now thinking that this function should call updateFunc one last time with the final progress before returning, and we don't need to set the progress to 100% in the controller.

@prydonius prydonius requested review from skriss, nrb and carlisia August 29, 2019 20:55
Copy link
Contributor

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach LGTM - one question so far

@prydonius
Copy link
Contributor Author

I've been trying a different approach here in order to address #1821 (comment), but I'm seeing something strange. The restic backup --json command finishes with a summary line, something like:

{"message_type":"summary","files_new":0,"files_changed":0,"files_unmodified":10,"dirs_new":0,"dirs_changed":0,"dirs_unmodified":0,"data_blobs":0,"tree_blobs":0,"data_added":0,"total_files_processed":10,"total_bytes_processed":21043871744,"total_duration":0.363935634,"snapshot_id":"5e764a0b"}

I'm trying to read this line after the command has finished to extract total_bytes_processed and use that for a final update of the backup progress. This is consistently the last line when I manually run restic backup --json both on my local machine and when exec'ing in the Pod. However, in the controller, it is sometimes the last line and other times the penultimate line (followed by another "status" line). This could be because the last two lines are being written to the buffer around the same time and so the order can be different?

Might need some extra eyes on this next week @skriss.

@skriss
Copy link
Contributor

skriss commented Sep 3, 2019

Just ran a restic backup cmd locally using restic v0.9.5 - got the following output at the end:

{"message_type":"status","seconds_elapsed":9,"percent_done":1,"total_files":428,"files_done":428,"total_bytes":2168012,"bytes_done":2168012}
{"message_type":"status","seconds_elapsed":10,"percent_done":1,"total_files":428,"files_done":428,"total_bytes":2168012,"bytes_done":2168012}
{"message_type":"status","seconds_elapsed":11,"percent_done":1,"total_files":428,"files_done":428,"total_bytes":2168012,"bytes_done":2168012}
{"message_type":"summary","files_new":428,"files_changed":0,"files_unmodified":0,"dirs_new":0,"dirs_changed":0,"dirs_unmodified":0,"data_blobs":428,"tree_blobs":1,"data_added":2168388,"total_files_processed":428,"total_bytes_processed":2168012,"total_duration":11.171735739,"snapshot_id":"4e9479ef"}
{"message_type":"summary","files_new":428,"files_changed":0,"files_unmodified":0,"dirs_new":0,"dirs_changed":0,"dirs_unmodified":0,"data_blobs":428,"tree_blobs":1,"data_added":2168388,"total_files_processed":428,"total_bytes_processed":2168012,"total_duration":11.171735739,"snapshot_id":"4e9479ef"}
{"message_type":"status","seconds_elapsed":11,"percent_done":1,"total_files":428,"files_done":428,"total_bytes":2168012,"bytes_done":2168012}

So duplicate "summary" lines, and a "status" line after the "summary" ones. Seems like there's probably some bugs in the status reporter in restic.

@prydonius
Copy link
Contributor Author

Okay I've filed restic/restic#2389 upstream to see if we can get help on this issue. In the meantime, we can workaround this by looking for the summary line instead of just trying to get the last line.

@prydonius prydonius force-pushed the 1749-restic-progress branch from e8a6a99 to 8e6cc51 Compare September 3, 2019 23:07
@prydonius prydonius marked this pull request as ready for review September 3, 2019 23:58
@prydonius
Copy link
Contributor Author

This is now ready for review, I'll work on the restore progress and CLI changes in a separate PR

Copy link
Contributor Author

@prydonius prydonius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added displaying the backup progress when running velero backup describe --details:

velero-backup-progress-demo

@@ -188,7 +188,8 @@ func describePodVolumeRestores(d *Describer, restores []v1.PodVolumeRestore, det
restoresByPod := new(volumesByPod)

for _, restore := range restoresByPhase[phase] {
restoresByPod.Add(restore.Spec.Pod.Namespace, restore.Spec.Pod.Name, restore.Spec.Volume)
// TODO(adnan): replace last parameter with progress from status (#1749)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will address this in the PR for restore progress

@skriss
Copy link
Contributor

skriss commented Sep 4, 2019

Looks good on a quick pass - I'll do some testing and another pass later on today.

Adnan Abdulhussein added 2 commits September 4, 2019 13:31
Signed-off-by: Adnan Abdulhussein <[email protected]>

cmd.Wait()
close(quit)
wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this wait group, do we? I think the "quit" logic could be simplified - after L120, call ticker.Stop() (you'll need to move ticker's declaration to outside the goroutine), then close(quit). The select case that receives from quit can just return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's mostly left over from the additional goroutines we had for copying stdout/stderr. One issue is that if we're in the middle of the ticker.C case and it takes some time to complete, then we might call updateFunc in the goroutine with BytesDone < TotalBytes after our call to it at the end of this function. If we wait for the goroutine to finish, we know this will definitely not happen.

That said, the stuff in the ticker.C case should be pretty fast for backup, for restore I think we'll need the waitgroup since calculating the size of the volume can take an indeterminate amount of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's a good point. I guess the other way you could do this is to send on the quit channel rather than closing it, since that would block until it's received from, but I'm fine with this as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skriss ah good point, I didn't think of that. I like the idea of removing the unnecessary waitgroup, so will update this to do that - thanks!

Copy link
Contributor

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @nrb @carlisia PTAL!

Signed-off-by: Adnan Abdulhussein <[email protected]>
Signed-off-by: Adnan Abdulhussein <[email protected]>
Signed-off-by: Adnan Abdulhussein <[email protected]>
@prydonius
Copy link
Contributor Author

addressed all the review comments here - PTAL @nrb @carlisia!

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 great comments, and tests.

@carlisia carlisia merged commit 1e182e5 into vmware-tanzu:master Sep 10, 2019
@prydonius prydonius deleted the 1749-restic-progress branch September 10, 2019 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants