When a backup or restore fails, provide the user information or instructions to find out root cause #305

rdodev · 2018-02-07T20:09:14Z

Presently a backup can fail for a number reasons. If the user runs ark backup describe they will see it the backup failed, but no reason, logs or anything that would help them understand the problem and how to fix it:


[centos@ip ark]$ ./ark backup describe nginx-001
Name:         nginx-001
Namespace:    heptio-ark
Labels:       <none>
Annotations:  <none>

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Phase:  Failed

Backup Format Version:  1

Expiration:  2018-03-09 19:50:47 +0000 UTC

Validation errors:  <none>

Persistent Volumes: <none included>

Having a Logs section there or else instructions to find out what happen would be greatly helpful.

The text was updated successfully, but these errors were encountered:

ncdc · 2018-02-14T21:16:00Z

Example error only seen in ark server pod log:

time="2018-02-14T00:50:28Z" level=error msg="backup failed" error="rpc error: code = Unknown desc = error putting object mybackup/ark-backup.json: AccessDenied: Access Denied\n\tstatus code: 403, request id: 096E12AAC407F5B7, host id: ..../....=" key=heptio-ark/mybackup logSource="pkg/controller/backup_controller.go:258"

donbecker · 2018-04-23T20:51:59Z

Have just set up ark and am getting this, using AWS and the nginx example (non PV). Have I overlooked troubleshooting steps?

donbecker · 2018-04-23T20:53:06Z

ark backup logs <backup name>

ncdc · 2018-04-23T20:53:35Z

@donbecker if you wouldn't mind, please join us in #ark-dr on the Kubernetes slack for real-time troubleshooting, or create a new issue. This issue is an RFE to provide more details to the user why a backup failed. Thanks!

rosskukulinski · 2018-06-17T20:23:47Z

This is super important from a usability perspective. While the ark server log can provide debugging information, it might be hard to hunt down, especially because users creating backups may not have access to the Ark server logs.

One possibility would be to leverage thestatus field in Backup CRD to reflect error/failure details (may require k8s 1.11 - #529) or alternatively leverage Kubernetes Events API to track backup or restore failure events. This is also likely related to backup progress: #20

ncdc · 2018-06-19T15:44:16Z

@rosskukulinski the status field is currently supported; with k8s 1.11 we gain the ability to use the /status subresource in the http request.

rosskukulinski · 2018-07-18T16:48:42Z

Product question: What are the the common errors/error states that we want to be able to resolve.

Sources that can help piece together what happened:

per-backup / per-restore logs
restores have warnings/errors file
ark server log
restic pod logs

Restores (Related: #286)

per-restore log
errors file
warnings file

carlisia · 2019-09-18T20:43:39Z

When we are helping users debug Velero, we often ask for the output from describe as well as the logs for the backup/restore. With the logs, it usually it's not completely helpful unless they can reproduce the failure after setting the log level to debug, which increases the amount of logging to sort thru. And for backups stuck in "InProgress" or for more complicated cases, we have to dig thru the output of the entire Velero log.

One alternative to make debugging easier and faster, and to potentially address this request:

We know at every step of the way what activity the backup/restore is performing. We could keep a running list of these "events" and add them to the describe output, the way Kubernetes does. Knowing where in the process the failure occurred could itself be a hint for how to fix the issue, but otherwise it is a great starting point for where in the logs to start looking.

ncdc · 2019-09-18T21:25:54Z

FYI, events have a TTL and are automatically deleted after they expire. It's usually a pretty short amount of time - 1 to 2 hours by default, iirc. Also, each event is its own resource in etcd, so you would probably want to avoid having thousands of events for each backup/restore. Finally, the default client-side event broadcasting code that's in client-go has a "spam filter" to make sure that a single component + object target isn't overloading the system. The defaults are fairly low - I think it's something like 10 or 25 events in a minute, if you exceed the threshold, the events that you generated are silently dropped and it's really confusing trying to figure out why they're disappearing into thin air.

carlisia · 2019-09-18T21:32:36Z

Thanks for the explanation. Yes, I have noticed that events go away, but didn't know why, this is helpful.

Maybe events is not what we need, sounds like overkill. Trying again, w/o using a vocabulary that has any k8s meaning: We have a limited number of 'steps" that happen from running "create" until the backup reaches its final phase of "Complete". I submit that it would be helpful to list these steps in the describe output. And I think they are not so many that would be unreasonable. So, instead of "Events", we would have our own "Steps" section.

stale · 2021-07-08T15:37:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-07-22T18:13:09Z

Closing the stale issue.

ncdc changed the title ~~When a backup fails, provide the user information or instructions to find out root cause~~ When a backup or restore fails, provide the user information or instructions to find out root cause Feb 7, 2018

ncdc added enhancement labels May 8, 2018

rosskukulinski added the Enhancement/User End-User Enhancement to Velero label Jun 17, 2018

rosskukulinski added this to the v1.0.0 milestone Jun 17, 2018

rosskukulinski mentioned this issue Jun 18, 2018

Consider consolidating per-restore logs into a single file? (Similarly for backups) #286

Closed

rosskukulinski removed the Enhancement label Jun 25, 2018

ncdc mentioned this issue Jun 26, 2018

Backups fail if PVCs are in Lost status #225

Closed

rosskukulinski added the Needs Product Blocked needing input or feedback from Product label Jul 18, 2018

ncdc modified the milestones: v1.0.0, v1.x Nov 9, 2018

skriss added the Candidate for close Issues that should be closed and need a team review before closing label Sep 10, 2019

skriss removed the Candidate for close Issues that should be closed and need a team review before closing label Sep 18, 2019

skriss removed the P1 - Important label Feb 19, 2020

nrb removed this from the v1.x milestone Dec 8, 2020

dsu-igeek added the Reviewed Q2 2021 label May 3, 2021

stale bot added the staled label Jul 8, 2021

stale bot closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When a backup or restore fails, provide the user information or instructions to find out root cause #305

When a backup or restore fails, provide the user information or instructions to find out root cause #305

rdodev commented Feb 7, 2018

ncdc commented Feb 14, 2018

donbecker commented Apr 23, 2018

donbecker commented Apr 23, 2018 •

edited

Loading

ncdc commented Apr 23, 2018

rosskukulinski commented Jun 17, 2018

ncdc commented Jun 19, 2018

rosskukulinski commented Jul 18, 2018

carlisia commented Sep 18, 2019 •

edited

Loading

ncdc commented Sep 18, 2019

carlisia commented Sep 18, 2019

stale bot commented Jul 8, 2021

stale bot commented Jul 22, 2021

When a backup or restore fails, provide the user information or instructions to find out root cause #305

When a backup or restore fails, provide the user information or instructions to find out root cause #305

Comments

rdodev commented Feb 7, 2018

ncdc commented Feb 14, 2018

donbecker commented Apr 23, 2018

donbecker commented Apr 23, 2018 • edited Loading

ncdc commented Apr 23, 2018

rosskukulinski commented Jun 17, 2018

ncdc commented Jun 19, 2018

rosskukulinski commented Jul 18, 2018

carlisia commented Sep 18, 2019 • edited Loading

ncdc commented Sep 18, 2019

carlisia commented Sep 18, 2019

stale bot commented Jul 8, 2021

stale bot commented Jul 22, 2021

donbecker commented Apr 23, 2018 •

edited

Loading

carlisia commented Sep 18, 2019 •

edited

Loading