Backup progress #20

kstewart · 2017-08-09T00:14:45Z

Provide a way for users to see the progress of an in-flight backup. Some thoughts:

if backing up all namespaces, first get a list of all namespaces so we know how many there are to process
try to find out how many different types of GroupResources are being backed up, so we can record progress per namespace
periodically record status somewhere in backup.status

ncdc · 2017-08-10T18:29:17Z

Idea: store per-backup log file to object storage. Add ark backup logs command to retrieve it.

ncdc · 2017-09-06T17:39:47Z

Note, backup logs are separate from progress. We'll be coming up with ways to track real-time progress as described above to close out this issue.

skriss · 2017-09-06T19:59:53Z

For backups, we process first by resource then by namespace.
For restores, we process first by namespace then by resource.

I think for progress reporting, we should not be tightly coupled to the current mode/order of processing since that could change. We can count the number of resource-namespace combinations and then report as each pair gets completed.

My initial thought is something like:

// OperationProgress describes the overall progress for a backup
// or restore operation.
//
// BackupStatus and RestoreStatus will each have a field of this type.
type OperationProgress struct {
	// PercentComplete is a weighted sum of individual items' progress.
	// Each item has a weight of 1/len(Items), and a value equal to the
	// item's PercentComplete. This is provided as a useful summary
	// of the more detailed item progress information.
	PercentComplete int

	// Items is the individual units of work that make up the
	// operation (currently defined as namespace-resource pairs).
	Items []ItemProgress
}

// ItemProgress describes the current backup/restore progress for
// a namespace-resource pair.
type ItemProgress struct {
	Namespace string
	Resource  string

	// PercentComplete is computed simply as (# items processed)
	// divided by (total # items for namespace-resource pair)
	PercentComplete int
}

cc @ncdc

skriss · 2017-09-26T18:43:01Z

@ncdc let me know if you have any thoughts on this. we may have to wait until we decide on a revised backup/restore design before finalizing implementation plan.

ncdc · 2017-09-26T18:55:34Z

@skriss it would be nice if operationProgress.percentComplete were more accurate, based on the percentages of the individual ItemProgresses.

When we do a backup, we know up front how many different types of resources we have, and how many items we have per resource type. If we store that information in backup.status, we can use it when restoring. Doing it that way, we could simply use % complete per resource type, and not worry about namespace.

ncdc · 2017-09-26T19:05:03Z

We could store a map of GroupResource string to

type ResourceStatus struct {
  Processed int
  Total int
}

It could look like this:

resourceProgress:
  pods:
    processed: 15
    total: 100
  storageclasses.storage.k8s.io:
    processed: 0
    total: 3

ncdc · 2017-09-26T19:12:03Z

And if we need to precalculate and store percentages, we could, although it's probably easy enough not to and just let consumers do it.

Also, if we ever move to a work queue and we want to have multiple workers independently updating progress, we could get a lot of conflicts and retries trying to update a single map in a single Backup. Maybe json patch would help there...

skriss · 2017-09-26T19:56:14Z

I'd be fine just doing it per-resource rather than also by namespace.

When we do a backup, we know up front how many different types of resources we have, and how many items we have per resource type

How do we know total # items per resource type? We list/back them up per-namespace

If we store that information in backup.status, we can use it when restoring

I like this idea in theory, but need to think about how it interacts with restore includes/excludes, label selectors

Also, if we ever move to a work queue and we want to have multiple workers independently updating progress, we could get a lot of conflicts and retries trying to update a single map in a single Backup. Maybe json patch would help there...

WDYT about having a single goroutine responsible for updating progress, and the workers just report to that goroutine?

rdodev · 2018-02-09T16:19:19Z

Regardless of implementation approach chosen, I would strongly suggest against using percentages to measure progress. I would recommend using "x of y" instead.

ncdc · 2018-02-09T16:22:22Z

Yeah we won't - as I wrote above

And if we need to precalculate and store percentages, we could, although it's probably easy enough not to and just let consumers do it.

ncdc · 2018-03-09T21:29:00Z

cc @jbeda - another UX question

jbeda · 2018-03-10T00:34:39Z

I think that status of a backup comes down to a set of questions:

Is this thing making progress?
How long is this going to take? Should I wait? Get coffee? Go home? Come back next week?

The problem with percentages is that it is hard to answer these if it will take a long time. If it takes 5 minutes to move one percent then you have to wait 5-10 minutes to get an idea. With that in mind, having the raw data helps. Doesn't have to be super accurate -- something like "tasks". Or some counter that moves regularly.

nrb · 2018-04-17T16:32:13Z

I like this idea in theory, but need to think about how it interacts with restore includes/excludes, label selectors

My initial impression here is that we'd store the total in the backup, but may have to recalculate when doing a selective restore. So a backup could hold 10 items, but we only want 6 in this restore. That does mean we're duplicating the logic, but currently I think that's not terrible.

jmontleon · 2019-03-20T18:18:51Z

This is pretty rough, although working pretty well.

It's using the json output in restic (using master, json output as added after 0.9.4) to update the podvolumebackup CR with progress. It's also dumping the restic output to the pod logs, although it's an awful lot of output so maybe that's not great.

openshift#4

I don't know if and approach like this, if cleaned up, would be interesting?

Unfortunately at this time it looks like restic still does not yet have output for restores so a similar approach is not yet possible for restore.

skriss · 2019-03-21T21:48:54Z

@jmontleon I really like the idea. Looking at it some more.

jmontleon · 2019-03-21T22:52:24Z

@skriss if you'd like to see it in action I have an image at docker.io/jmontleon/velero with the changes. Should work if you just update the image on the restic daemonset and velero deployment in a test environment and perform a backup with restic. It's updating at a 10 second interval, which could probably be fixed to be an optional parameter so as long as the backup takes 30-60 seconds or so you should get an idea of what it looks like.

skriss · 2019-04-02T14:45:34Z

@jmontleon sorry I've been slow in providing feedback here - hasn't fallen off my radar.

Rebase against upstream again after SkipRestore feature accepted upstream

vishnuitta · 2020-05-12T06:56:28Z

thanks @skriss and @nrb for having fix for this issue..
I have one question regarding backup progress when using plugins:
Does this PR also covers taking progress/feedback from the plugins that are doing backup of volume's data?

skriss · 2020-05-12T15:35:06Z

It does not - no information is collected from the plugins to inform progress. We could consider that, but it would be a separate enhancement.

* Bump Kubevirt to newer version that supports Vm freeze A latest Kubevirt v48.1 wich contains "Added virt-freezer kubevirt/kubevirt#6124" is used. Signed-off-by: Bartosz Rybacki <[email protected]> * Bump Kubernetes - 1.21 Signed-off-by: Bartosz Rybacki <[email protected]> * Cleanup makefile target - go mod Signed-off-by: Bartosz Rybacki <[email protected]> * Vendor in CDI 1.40 Vendor in newer CDI To get a NodePull method. Signed-off-by: Bartosz Rybacki <[email protected]> * Use a vm with guest agent enabled To correctly use freeze-unfreeze a guest agent has to be installed in vm. Tests need to Wait for a vm condition AgentConnected. New VM image is bigger, so a NodePull method is used to utilize node cache. Signed-off-by: Bartosz Rybacki <[email protected]> * Do not fail in the AfterEach Signed-off-by: Bartosz Rybacki <[email protected]> * Fix one and Exclude one test One test is failing and needs to be investigated Signed-off-by: Bartosz Rybacki <[email protected]>

kstewart mentioned this issue Aug 9, 2017

Restore progress #21

Closed

ncdc added the p1 - Important label Aug 9, 2017

ncdc self-assigned this Aug 11, 2017

ncdc modified the milestone: v0.4.0 Sep 5, 2017

ncdc removed their assignment Sep 6, 2017

ncdc modified the milestones: v0.4.0, v0.5.0 Sep 7, 2017

ncdc removed this from the v0.5.0 milestone Oct 12, 2017

ncdc mentioned this issue Nov 30, 2017

Support watching/waiting for a backup/restore to finish after creating it #233

Closed

ncdc mentioned this issue Jan 5, 2018

Clarify backup and restore creation messages #270

Merged

rosskukulinski mentioned this issue Jun 17, 2018

When a backup or restore fails, provide the user information or instructions to find out root cause #305

Closed

rosskukulinski added this to the v1.0.0 milestone Jun 24, 2018

ncdc mentioned this issue Jul 2, 2018

Add opentracing support #121

Closed

rosskukulinski added the Enhancement/User End-User Enhancement to Velero label Jul 18, 2018

rosskukulinski added P3 - Wouldn't it be nice if... and removed P1 - Important labels Jul 18, 2018

ncdc mentioned this issue Jul 20, 2018

Wait for namespace to finish terminating on restore. #691

Closed

rosskukulinski added the Help wanted label Aug 8, 2018

rosskukulinski modified the milestones: v1.0.0, post, post-v1.0.0 Aug 8, 2018

ncdc added this to the v1.x milestone Nov 9, 2018

dymurray added a commit to dymurray/velero that referenced this issue Apr 16, 2019

Merge pull request vmware-tanzu#20 from sseago/fusor-dev-rebase6

fe21b64

Rebase against upstream again after SkipRestore feature accepted upstream

skriss mentioned this issue Aug 9, 2019

add progress reporting for restic backups & restores #1749

Closed

skriss self-assigned this Apr 6, 2020

skriss mentioned this issue Apr 24, 2020

report backup progress #2440

Merged

skriss modified the milestones: v1.x, v1.4 Apr 27, 2020

nrb closed this as completed in #2440 May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup progress #20

Backup progress #20

kstewart commented Aug 9, 2017

ncdc commented Aug 10, 2017 •

edited

Loading

ncdc commented Sep 6, 2017

skriss commented Sep 6, 2017 •

edited

Loading

skriss commented Sep 26, 2017 •

edited

Loading

ncdc commented Sep 26, 2017

ncdc commented Sep 26, 2017

ncdc commented Sep 26, 2017

skriss commented Sep 26, 2017 •

edited

Loading

rdodev commented Feb 9, 2018

ncdc commented Feb 9, 2018

ncdc commented Mar 9, 2018

jbeda commented Mar 10, 2018

nrb commented Apr 17, 2018

jmontleon commented Mar 20, 2019 •

edited

Loading

skriss commented Mar 21, 2019

jmontleon commented Mar 21, 2019 •

edited

Loading

skriss commented Apr 2, 2019

vishnuitta commented May 12, 2020

skriss commented May 12, 2020

Backup progress #20

Backup progress #20

Comments

kstewart commented Aug 9, 2017

ncdc commented Aug 10, 2017 • edited Loading

ncdc commented Sep 6, 2017

skriss commented Sep 6, 2017 • edited Loading

skriss commented Sep 26, 2017 • edited Loading

ncdc commented Sep 26, 2017

ncdc commented Sep 26, 2017

ncdc commented Sep 26, 2017

skriss commented Sep 26, 2017 • edited Loading

rdodev commented Feb 9, 2018

ncdc commented Feb 9, 2018

ncdc commented Mar 9, 2018

jbeda commented Mar 10, 2018

nrb commented Apr 17, 2018

jmontleon commented Mar 20, 2019 • edited Loading

skriss commented Mar 21, 2019

jmontleon commented Mar 21, 2019 • edited Loading

skriss commented Apr 2, 2019

vishnuitta commented May 12, 2020

skriss commented May 12, 2020

ncdc commented Aug 10, 2017 •

edited

Loading

skriss commented Sep 6, 2017 •

edited

Loading

skriss commented Sep 26, 2017 •

edited

Loading

skriss commented Sep 26, 2017 •

edited

Loading

jmontleon commented Mar 20, 2019 •

edited

Loading

jmontleon commented Mar 21, 2019 •

edited

Loading