Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user-callback is not called properly when an error occurs #1491

Closed
jean-christophe-manciot opened this issue Aug 7, 2023 · 26 comments
Closed

Comments

@jean-christophe-manciot
Copy link

backintime-common/backintime-qt: 1.3.3-4 https://fr.archive.ubuntu.com/ubuntu lunar/universe amd64 Packages

When the "Invalid cross-device link" issue happens, the snapshot fails but the user-callback is not called properly. It should be called as:

user-callback <profile_id> <profile_name> 4

I discovered this issue because my user-callback is supposed to send an email when an error (4) happens, but I received nothing.
However, If I run it manually with the right parameters, for instance ./user-callback 1 Main 4, the email is sent & received.

Attached files:

  1. bit diagnotics: bit-diags.txt
  2. user-callback
@aryoda
Copy link
Contributor

aryoda commented Aug 8, 2023

I think this bug may occur whenever rsync exits with an error.

I will look into our code and try to fix this when I find the time for this (holidays at the moment).

The underlying error reason (documented in #851) I cannot fix unfortunately, this needs the eyes of an rsync specialist (cmd arg tweaking required I guess)...

@aryoda
Copy link
Contributor

aryoda commented Aug 8, 2023

Possibly related to #1053

@aryoda
Copy link
Contributor

aryoda commented Aug 9, 2023

First bug analysis results

I could reproduce the underlying error ("invalid cross-device link", see #851 (comment)) and have traced which user callback functions are called in case of a successful vs. erroneous backup (takeSnapshot):

------- user callback sequence for backup w/o errors ---------------------------
2023-08-09 07:32:35 (<user>/168194) profile_id=1: 7 - mount drive requested
2023-08-09 07:32:36 (<user>/168199) profile_id=1: 1 - backup process begins
2023-08-09 07:32:38 (<user>/168310) profile_id=1: 3 - new snapshot taken (snapshot_id=20230809-093235-919 - snapshot_name=/home/<user>/temp/backintime/AMD5600x/<user>/1/20230809-093235-919)
2023-08-09 07:32:38 (<user>/168313) profile_id=1: 2 - backup process ends
2023-08-09 07:32:38 (<user>/168317) profile_id=1: 8 - unmount drive requested

---- user callback sequence with an rsync error 23 in takeSnapshot -------------
2023-08-09 07:34:44 (<user>/168384) profile_id=6: 7 - mount drive requested
2023-08-09 07:34:44 (<user>/168389) profile_id=6: 1 - backup process begins
2023-08-09 07:34:46 (<user>/168419) profile_id=6: 2 - backup process ends
2023-08-09 07:34:46 (<user>/168423) profile_id=6: 8 - unmount drive requested

The user callback 4 is clearly not called in case of an error (as reported in the issue).

Suspected code location

The takeSnapshot function returns False for both ret_val and ret_error` which indicates "no (complete) snapshot taken + no error occured".

This iconstellation shortcuts the user callback 4 call.

Next steps

  • Find out why takeSnapshot does not report ret_error=True even though rsync exits with the exit code 23
  • Fix it

@buhtz
Copy link
Member

buhtz commented Aug 9, 2023

Maybe is #489 related (but not relevant for a solution). It is about killing rsync with BIT not noticing it.

@aryoda
Copy link
Contributor

aryoda commented Aug 9, 2023

@buhtz Sharp eyes, I will try to test the "kill rsync" behaviour too since a solution affects the same code locations that I am just debugging...

@buhtz
Copy link
Member

buhtz commented Aug 9, 2023

@buhtz Sharp eyes

It was more a foggy brain. 😃 I remembered there was "something about rsync return codes" in the past. While searching for it I found that Issue.

But what I remembered was about an unchecked return code when doing smartRemove(). But that was fixed via PR #1458.

@jean-christophe-manciot
Copy link
Author

@aryoda @buhtz

Nice job! We're getting closer to a solution.

  • How can you trace which user callback functions are called?
  • I remember that it happened when I launched BIT through the GUI
  • This issue might also happen when an error different from "Invalid cross-device link" occurs

@aryoda
Copy link
Contributor

aryoda commented Aug 9, 2023

* How can you trace which user callback functions are called?

I am using my diagnostics callback script for that: https://github.com/bit-team/user-callback/blob/master/user-callback.diagnostics

Documentation on how to use this could be better (how to install without overwriting the existing script, what is does, when to use, where to find the logged output...) but given our lack of resources this has low prio

* I remember that it happened when I launched BIT through the GUI

It will also happen when running BiT CLI or even the rsync command alone (which in contained in each snapshot folder in the takesnapshot.log.bz2 and a good way to check if the problem occurs with rsnyc alone...

* This issue might also happen when an error different from "Invalid cross-device link" occurs

Yes, I think so too and will try to fix this

@jean-christophe-manciot
Copy link
Author

I am using my diagnostics callback script for that: https://github.com/bit-team/user-callback/blob/master/user-callback.diagnostics

Very nice! I'll add it to my user-callback so that I'll have some better trace next time an issue happens.

@aryoda
Copy link
Contributor

aryoda commented Aug 16, 2023

@jean-christophe-manciot

I have just sent a PR (#1502) with all a lot of fixes. It would be great if find the time and a way to test this new version (clone + build without a make install so you can keep the installed and the patched version in parallel).

I have also just updated my diagnostics user callback script to support the two new error codes (and fix a bug with a wrong logging text):

https://github.com/bit-team/user-callback/blob/master/user-callback.diagnostics

aryoda added a commit that referenced this issue Aug 17, 2023
…lling rsync not handled gracefully / rsync exit code not evaluated) (#1502)

Fix bug: Taking a snapshot reports `rsync` errors now even if no snapshot was taken (#1491)
Fix bug: takeSnapshot() recognizes errors now by also evaluating the rsync exit code (#489)
         Fixes related problem: Killing `rsync` was not handled gracefully (by ignoring the rsync exit code)
Fix bug: The error user-callback is now always called if an error happened during taking a snapshot (#1491)
Feature: Introduce new error codes for the "error" user callback (as part of #1491):
             5: Error while taking a snapshot.
             6: New snapshot taken but with errors.
Improvement: The `rsync` exit code is now contained in the snapshot log. Example:
             [E] Error: 'rsync' ended with exit code -9 (negative values are signal numbers, see 'kill -l')
Fix CHANGES entries (stick to our standards)

---------

Co-authored-by: aryoda <[email protected]>
@jean-christophe-manciot
Copy link
Author

jean-christophe-manciot commented Aug 17, 2023

Before testing it tonight, I have built f9434b7 for Ubuntu lunar and made it available here.

@jean-christophe-manciot
Copy link
Author

And also 976081b

@buhtz
Copy link
Member

buhtz commented Aug 17, 2023

I want to learn more about your distro maintenance.
You do point to two commits. Did you just patched or build from our latest version?

Based on the version number backintime-common_1.3.3-5~git20230816.f9434b7+23.04_all.deb it seems to be a patch, right?

What happens when Debian decide to have a 1.3.3-5 with different patches in their repo?

The links you provide do look kind of "official". Is this Ubuntu "main" or "universe" (the community)?
I can't see your packages listed somewhere here:
https://packages.ubuntu.com/search?suite=all&searchon=names&keywords=backintime

You do have *kde*.deb and *qt4*.deb files behind your link. How comes that? They shouldn't exist anymore.

What is "lunar"? From a PR perspective it is nice to have fancy names. Even the old conservative Debian do use this. But it would improve communication if you could use a version number always. That makes it easier for people not using Ubuntu on a regular base to understand the context. Even as a regular Debian user I'm not able to remember the "name" of my current version. 🤣

@jean-christophe-manciot
Copy link
Author

I want to learn more about your distro maintenance. You do point to two commits. Did you just patched or build from our latest version?

Built.

What happens when Debian decide to have a 1.3.3-5 with different patches in their repo?

It will supersede all 1.3.3-5~git* thanks to the ~.

The links you provide do look kind of "official". Is this Ubuntu "main" or "universe" (the community)?

It is official in the sense that all packages are built and tested from only official sources, without any modification.
But in the Ubuntu world, it is a PPA (Personal Package Archives), so you need to install it on an Ubuntu lunar machine to be able to update your packages from it.

What is "lunar"? From a PR perspective it is nice to have fancy names. Even the old conservative Debian do use this. But it would improve communication if you could use a version number always. That makes it easier for people not using Ubuntu on a regular base to understand the context. Even as a regular Debian user I'm not able to remember the "name" of my current version. 🤣

Fair enough. Lunar is the latest Ubuntu stable distribution 23.04.

@jean-christophe-manciot
Copy link
Author

I also maintain a Debian repository, but I haven't built and tested the latest unstable debs for that environment yet.

@jean-christophe-manciot
Copy link
Author

jean-christophe-manciot commented Aug 17, 2023

You do have *kde*.deb and *qt4*.deb files behind your link. How comes that? They shouldn't exist anymore.

If you want to remove them, you'll have to update the debian/control file.

@aryoda
Copy link
Contributor

aryoda commented Aug 17, 2023

@jean-christophe-manciot Wow, didn't know about your work "behind the curtain" to maintain installation packages!

I think you also should include the PR #1503 to avoid a an traceback when trying to show the systray icon... (since we are on the dev branch)

@buhtz We have collected a lot of changes now in DEV, how about preparing a new interim release (e.g. for the ARCH maintainers who give a quick feedback in case of problems)?

@buhtz
Copy link
Member

buhtz commented Aug 17, 2023

Dear jean-christophe,
thanks for explaining. I learned something.

@jean-christophe-manciot
Copy link
Author

@aryoda

Wow, didn't know about your work "behind the curtain" to maintain installation packages!

It is my way of giving back to the open source community for its awesome work ;-).

I think you also should include the PR #1503 to avoid a an traceback when trying to show the systray icon... (since we are on the dev branch)

976081b is the latest dev commit that I see for now, but I'll try to build new unstable debs for any new related commits.

@aryoda
Copy link
Contributor

aryoda commented Aug 17, 2023

976081b is the latest dev commit that I see for now, but I'll try to build new unstable debs for any new related commits.

Yes, my other PR (#1503) is still in the approval pipeline before I can merge it (hopefully today)

@buhtz If you could find the time to review my PR, I have changed only a few LoCs...

@buhtz
Copy link
Member

buhtz commented Aug 17, 2023

@buhtz If you could find the time to review my PR, I have changed only a few LoCs...

Did. Didn't you get notification about it? I might have missed a button somewhere in the GitHub frontend.

@aryoda
Copy link
Contributor

aryoda commented Aug 17, 2023

Did. Didn't you get notification about it? I might have missed a button somewhere in the GitHub frontend.

#1503 (it is a new one, different from the fix here)?

@buhtz
Copy link
Member

buhtz commented Aug 17, 2023

#1503 (it is a new one, different from the fix here)?
Missed a button.

@jean-christophe-manciot
Copy link
Author

jean-christophe-manciot commented Aug 17, 2023

With the latest build (ecf9312) and the latest @aryoda diagnostics user callback script, I started a backup of profile 2 which failed with the error:

[E] Error: rsync: [generator] link "/<path>/new_snapshot/backup/<path>" => ../../20230817-190021-377/backup/<path> failed: Invalid cross-device link (18)

The user-callback.log contains:

2023-08-17 21:41:16 (root/2809530) profile_id=1: 5 - backintime-qt GUI started
2023-08-17 21:41:16 (root/2809580) profile_id=1: 7 - mount drive requested
2023-08-17 21:43:11 (root/2852961) profile_id=1: 8 - unmount drive requested
2023-08-17 21:43:14 (root/2856642) profile_id=2: 7 - mount drive requested
2023-08-17 21:43:14 (root/2856650) profile_id=2: 1 - backup process begins
2023-08-17 21:43:55 (root/2862664) profile_id=2: 4 - ERROR : Unknown error code!
2023-08-17 21:43:55 (root/2862664) profile_id=2:     Error message: Failed to take snapshot 2023-08-17 21:43:14.
2023-08-17 21:43:59 (root/2863185) profile_id=2: 2 - backup process ends
2023-08-17 21:43:59 (root/2863225) profile_id=2: 8 - unmount drive requested
2023-08-17 21:44:16 (root/2880815) profile_id=2: 8 - unmount drive requested
2023-08-17 21:44:16 (root/2880825) profile_id=2: 6 - backintime-qt GUI closed

What is strange is:

  • Unknown error code
  • 2 x profile_id=2: 8 - unmount drive requested

@aryoda
Copy link
Contributor

aryoda commented Aug 17, 2023

THX a lot for testing!

Regarding the "Unknown error code!" msg it looks like the old version of the diagnostics user callback script is called.

Could you please open the profile settings dialog, click on "edit user-callback" and check if it contains the new error codes:

https://github.com/bit-team/user-callback/blob/a4609984b30ff00b3fbac7148cd0b5476ba38c9d/user-callback.diagnostics#L65-L73

Regarding the duplicated "8 - unmount..." it looks like the second unmount was triggered by closing the GUI (same timestamp as the next log entry).

The GUI must also mount or unmount if you change the profile (not very nice and efficient, we have open issues here, e.g. #977, #724).

I will try to reproduce and debug your observations tomorrow night...

@jean-christophe-manciot
Copy link
Author

@aryoda

I failed to mention an important part: I did receive an email for the first time and the script has been called with the following parameters:

user-callback 2 <profile_name> 4 5 Failed to take snapshot 2023-08-17 21:43:14

I looks like the issue has been solved: bravo :-)

Regarding the "Unknown error code!" msg it looks like the old version of the diagnostics user callback script is called.

My bad: I did take the contents of your new user-callback but I kept my old $4 parameter name ```error_code````, hence the "Unknown error code".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants