Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a backup or restore fails, provide the user information or instructions to find out root cause #305

Closed
rdodev opened this issue Feb 7, 2018 · 12 comments
Labels
Enhancement/User End-User Enhancement to Velero Needs Product Blocked needing input or feedback from Product Reviewed Q2 2021 staled

Comments

@rdodev
Copy link

rdodev commented Feb 7, 2018

Presently a backup can fail for a number reasons. If the user runs ark backup describe they will see it the backup failed, but no reason, logs or anything that would help them understand the problem and how to fix it:


[centos@ip ark]$ ./ark backup describe nginx-001
Name:         nginx-001
Namespace:    heptio-ark
Labels:       <none>
Annotations:  <none>

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Phase:  Failed

Backup Format Version:  1

Expiration:  2018-03-09 19:50:47 +0000 UTC

Validation errors:  <none>

Persistent Volumes: <none included>

Having a Logs section there or else instructions to find out what happen would be greatly helpful.

@ncdc ncdc changed the title When a backup fails, provide the user information or instructions to find out root cause When a backup or restore fails, provide the user information or instructions to find out root cause Feb 7, 2018
@ncdc
Copy link
Contributor

ncdc commented Feb 14, 2018

Example error only seen in ark server pod log:

time="2018-02-14T00:50:28Z" level=error msg="backup failed" error="rpc error: code = Unknown desc = error putting object mybackup/ark-backup.json: AccessDenied: Access Denied\n\tstatus code: 403, request id: 096E12AAC407F5B7, host id: ..../....=" key=heptio-ark/mybackup logSource="pkg/controller/backup_controller.go:258"

@donbecker
Copy link

Have just set up ark and am getting this, using AWS and the nginx example (non PV). Have I overlooked troubleshooting steps?

@donbecker
Copy link

donbecker commented Apr 23, 2018

ark backup logs <backup name>

@ncdc
Copy link
Contributor

ncdc commented Apr 23, 2018

@donbecker if you wouldn't mind, please join us in #ark-dr on the Kubernetes slack for real-time troubleshooting, or create a new issue. This issue is an RFE to provide more details to the user why a backup failed. Thanks!

@rosskukulinski
Copy link
Contributor

This is super important from a usability perspective. While the ark server log can provide debugging information, it might be hard to hunt down, especially because users creating backups may not have access to the Ark server logs.

One possibility would be to leverage thestatus field in Backup CRD to reflect error/failure details (may require k8s 1.11 - #529) or alternatively leverage Kubernetes Events API to track backup or restore failure events. This is also likely related to backup progress: #20

@ncdc
Copy link
Contributor

ncdc commented Jun 19, 2018

@rosskukulinski the status field is currently supported; with k8s 1.11 we gain the ability to use the /status subresource in the http request.

@rosskukulinski rosskukulinski added the Needs Product Blocked needing input or feedback from Product label Jul 18, 2018
@rosskukulinski
Copy link
Contributor

Product question: What are the the common errors/error states that we want to be able to resolve.

Sources that can help piece together what happened:

  • per-backup / per-restore logs
  • restores have warnings/errors file
  • ark server log
  • restic pod logs

Restores (Related: #286)

  • per-restore log
  • errors file
  • warnings file

@ncdc ncdc modified the milestones: v1.0.0, v1.x Nov 9, 2018
@skriss skriss added the Candidate for close Issues that should be closed and need a team review before closing label Sep 10, 2019
@skriss skriss removed the Candidate for close Issues that should be closed and need a team review before closing label Sep 18, 2019
@carlisia
Copy link
Contributor

carlisia commented Sep 18, 2019

When we are helping users debug Velero, we often ask for the output from describe as well as the logs for the backup/restore. With the logs, it usually it's not completely helpful unless they can reproduce the failure after setting the log level to debug, which increases the amount of logging to sort thru. And for backups stuck in "InProgress" or for more complicated cases, we have to dig thru the output of the entire Velero log.

One alternative to make debugging easier and faster, and to potentially address this request:

We know at every step of the way what activity the backup/restore is performing. We could keep a running list of these "events" and add them to the describe output, the way Kubernetes does. Knowing where in the process the failure occurred could itself be a hint for how to fix the issue, but otherwise it is a great starting point for where in the logs to start looking.

@ncdc
Copy link
Contributor

ncdc commented Sep 18, 2019

FYI, events have a TTL and are automatically deleted after they expire. It's usually a pretty short amount of time - 1 to 2 hours by default, iirc. Also, each event is its own resource in etcd, so you would probably want to avoid having thousands of events for each backup/restore. Finally, the default client-side event broadcasting code that's in client-go has a "spam filter" to make sure that a single component + object target isn't overloading the system. The defaults are fairly low - I think it's something like 10 or 25 events in a minute, if you exceed the threshold, the events that you generated are silently dropped and it's really confusing trying to figure out why they're disappearing into thin air.

@carlisia
Copy link
Contributor

Thanks for the explanation. Yes, I have noticed that events go away, but didn't know why, this is helpful.

Maybe events is not what we need, sounds like overkill. Trying again, w/o using a vocabulary that has any k8s meaning: We have a limited number of 'steps" that happen from running "create" until the backup reaches its final phase of "Complete". I submit that it would be helpful to list these steps in the describe output. And I think they are not so many that would be unreasonable. So, instead of "Events", we would have our own "Steps" section.

@nrb nrb removed this from the v1.x milestone Dec 8, 2020
@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Jul 8, 2021
@stale
Copy link

stale bot commented Jul 22, 2021

Closing the stale issue.

@stale stale bot closed this as completed Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero Needs Product Blocked needing input or feedback from Product Reviewed Q2 2021 staled
Projects
None yet
Development

No branches or pull requests

8 participants