-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup progress #20
Comments
Idea: store per-backup log file to object storage. Add |
Note, backup logs are separate from progress. We'll be coming up with ways to track real-time progress as described above to close out this issue. |
For backups, we process first by resource then by namespace. I think for progress reporting, we should not be tightly coupled to the current mode/order of processing since that could change. We can count the number of resource-namespace combinations and then report as each pair gets completed. My initial thought is something like:
cc @ncdc |
@ncdc let me know if you have any thoughts on this. we may have to wait until we decide on a revised backup/restore design before finalizing implementation plan. |
@skriss it would be nice if operationProgress.percentComplete were more accurate, based on the percentages of the individual ItemProgresses. When we do a backup, we know up front how many different types of resources we have, and how many items we have per resource type. If we store that information in backup.status, we can use it when restoring. Doing it that way, we could simply use % complete per resource type, and not worry about namespace. |
We could store a map of GroupResource string to type ResourceStatus struct {
Processed int
Total int
} It could look like this: resourceProgress:
pods:
processed: 15
total: 100
storageclasses.storage.k8s.io:
processed: 0
total: 3 |
And if we need to precalculate and store percentages, we could, although it's probably easy enough not to and just let consumers do it. Also, if we ever move to a work queue and we want to have multiple workers independently updating progress, we could get a lot of conflicts and retries trying to update a single map in a single Backup. Maybe json patch would help there... |
I'd be fine just doing it per-resource rather than also by namespace.
How do we know total # items per resource type? We list/back them up per-namespace
I like this idea in theory, but need to think about how it interacts with restore includes/excludes, label selectors
WDYT about having a single goroutine responsible for updating progress, and the workers just report to that goroutine? |
Regardless of implementation approach chosen, I would strongly suggest against using percentages to measure progress. I would recommend using "x of y" instead. |
Yeah we won't - as I wrote above
|
cc @jbeda - another UX question |
I think that status of a backup comes down to a set of questions:
The problem with percentages is that it is hard to answer these if it will take a long time. If it takes 5 minutes to move one percent then you have to wait 5-10 minutes to get an idea. With that in mind, having the raw data helps. Doesn't have to be super accurate -- something like "tasks". Or some counter that moves regularly. |
My initial impression here is that we'd store the total in the backup, but may have to recalculate when doing a selective restore. So a backup could hold 10 items, but we only want 6 in this restore. That does mean we're duplicating the logic, but currently I think that's not terrible. |
This is pretty rough, although working pretty well. It's using the json output in restic (using master, json output as added after 0.9.4) to update the podvolumebackup CR with progress. It's also dumping the restic output to the pod logs, although it's an awful lot of output so maybe that's not great. I don't know if and approach like this, if cleaned up, would be interesting? Unfortunately at this time it looks like restic still does not yet have output for restores so a similar approach is not yet possible for restore. |
@jmontleon I really like the idea. Looking at it some more. |
@skriss if you'd like to see it in action I have an image at docker.io/jmontleon/velero with the changes. Should work if you just update the image on the restic daemonset and velero deployment in a test environment and perform a backup with restic. It's updating at a 10 second interval, which could probably be fixed to be an optional parameter so as long as the backup takes 30-60 seconds or so you should get an idea of what it looks like. |
@jmontleon sorry I've been slow in providing feedback here - hasn't fallen off my radar. |
Rebase against upstream again after SkipRestore feature accepted upstream
It does not - no information is collected from the plugins to inform progress. We could consider that, but it would be a separate enhancement. |
* Bump Kubevirt to newer version that supports Vm freeze A latest Kubevirt v48.1 wich contains "Added virt-freezer kubevirt/kubevirt#6124" is used. Signed-off-by: Bartosz Rybacki <[email protected]> * Bump Kubernetes - 1.21 Signed-off-by: Bartosz Rybacki <[email protected]> * Cleanup makefile target - go mod Signed-off-by: Bartosz Rybacki <[email protected]> * Vendor in CDI 1.40 Vendor in newer CDI To get a NodePull method. Signed-off-by: Bartosz Rybacki <[email protected]> * Use a vm with guest agent enabled To correctly use freeze-unfreeze a guest agent has to be installed in vm. Tests need to Wait for a vm condition AgentConnected. New VM image is bigger, so a NodePull method is used to utilize node cache. Signed-off-by: Bartosz Rybacki <[email protected]> * Do not fail in the AfterEach Signed-off-by: Bartosz Rybacki <[email protected]> * Fix one and Exclude one test One test is failing and needs to be investigated Signed-off-by: Bartosz Rybacki <[email protected]>
Provide a way for users to see the progress of an in-flight backup. Some thoughts:
The text was updated successfully, but these errors were encountered: