-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When a backup or restore fails, provide the user information or instructions to find out root cause #305
Comments
Example error only seen in ark server pod log:
|
Have just set up ark and am getting this, using AWS and the nginx example (non PV). Have I overlooked troubleshooting steps? |
|
@donbecker if you wouldn't mind, please join us in #ark-dr on the Kubernetes slack for real-time troubleshooting, or create a new issue. This issue is an RFE to provide more details to the user why a backup failed. Thanks! |
This is super important from a usability perspective. While the ark server log can provide debugging information, it might be hard to hunt down, especially because users creating backups may not have access to the Ark server logs. One possibility would be to leverage the |
@rosskukulinski the |
Product question: What are the the common errors/error states that we want to be able to resolve. Sources that can help piece together what happened:
Restores (Related: #286)
|
When we are helping users debug Velero, we often ask for the output from One alternative to make debugging easier and faster, and to potentially address this request: We know at every step of the way what activity the backup/restore is performing. We could keep a running list of these "events" and add them to the describe output, the way Kubernetes does. Knowing where in the process the failure occurred could itself be a hint for how to fix the issue, but otherwise it is a great starting point for where in the logs to start looking. |
FYI, events have a TTL and are automatically deleted after they expire. It's usually a pretty short amount of time - 1 to 2 hours by default, iirc. Also, each event is its own resource in etcd, so you would probably want to avoid having thousands of events for each backup/restore. Finally, the default client-side event broadcasting code that's in client-go has a "spam filter" to make sure that a single component + object target isn't overloading the system. The defaults are fairly low - I think it's something like 10 or 25 events in a minute, if you exceed the threshold, the events that you generated are silently dropped and it's really confusing trying to figure out why they're disappearing into thin air. |
Thanks for the explanation. Yes, I have noticed that events go away, but didn't know why, this is helpful. Maybe events is not what we need, sounds like overkill. Trying again, w/o using a vocabulary that has any k8s meaning: We have a limited number of 'steps" that happen from running "create" until the backup reaches its final phase of "Complete". I submit that it would be helpful to list these steps in the describe output. And I think they are not so many that would be unreasonable. So, instead of "Events", we would have our own "Steps" section. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Closing the stale issue. |
Presently a backup can fail for a number reasons. If the user runs
ark backup describe
they will see it the backup failed, but no reason, logs or anything that would help them understand the problem and how to fix it:Having a
Logs
section there or else instructions to find out what happen would be greatly helpful.The text was updated successfully, but these errors were encountered: