Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error handling #133

Merged
merged 11 commits into from
Aug 11, 2023
Merged

Conversation

Ichmed
Copy link
Contributor

@Ichmed Ichmed commented Jul 26, 2023

Builds on top of #131

Replaces all unwrap(), panic!() and exit() calls in podman.rs with Errors that are propagated to back up to the main function.

  1. Makes function API cleaner and more consistent
  2. Clearly marks which functions can error
  3. Result is a 'must_use' type, so errors can no longer be missed on accident
  4. Enables cleanup of an error state in the future by not terminating immediately

Since all errors simply propagate back up to main() the external behavior should be completely unchanged except for some hard coded error messages (could be added back in if needed).

Sorry for the big diff, but this time it's not possible to make it smaller since almost all return types were changed. Diff could be reduced in size by only adding Result to some methods, but this would only needlessly complicate the process in my eyes.

@Ichmed Ichmed marked this pull request as ready for review August 7, 2023 14:17
@Ichmed Ichmed requested a review from schaefi August 7, 2023 14:18
@schaefi
Copy link
Contributor

schaefi commented Aug 7, 2023

Thanks much, a very valuable change. I understand that it cannot be done smaller. Still not so easy to review though. Actually I was looking only on behavior change because I knew in some cases errors were ignored because, either ok to continue or required to keep the subsequent cleanup code running.

It's hard to check if the changes does not introduce a behavior change in this regard. I guess we only find out by testing. It looks good to me but can we go to the following sequence of tests:

  • flake-ctl podman register --container A --base base_for_A, with A wrong or base_for_A wrong, run it and check if podman image mount and podman mount doesn't list any open mounts

  • flake-ctl podman register --container A --bas base_for_A --layer B, with wrong layer B, run it and again check if mounts are open

I think there is more but I know in case of errors on the above a cleanup sequence has to run

I will do testing too... but need time and I'm also on vacation next week. I'll try to check on it the next days but might be blocked due to other work

Copy link
Contributor

@schaefi schaefi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Michael, I did some tests with this changes and run into a bit of trouble. So here is what I did:

  1. Test current code (error handling change not applied)
flake-ctl podman register --container ubdevtools --app /home/ms/ubuntu --target /usr/bin/bash
ms@asterix:/home/ms
> ./ubuntu 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }', podman-pilot/src/podman.rs:881:59
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

expected error because:

pub const CONTAINER_CID_DIR: &str = "/var/lib/containers/storage/tmp/flakes";

and 

ms@asterix:/home/ms
> ls -l /var/lib/containers/storage/tmp/flakes
ls: cannot access '/var/lib/containers/storage/tmp/flakes': Permission denied

running sudo ./ubuntu works, ok
  1. Test your branch (new error handling)
ms@asterix:/home/ms
> ./ubuntu 
✓ Launching flake            
^C

It let me run the flake when it should error, continues and hangs somewhere along the line

Can you double check on the way you handle the return of gc()

Thanks

@Ichmed
Copy link
Contributor Author

Ichmed commented Aug 8, 2023

I changed the behaviour of gc, it should now work the same as before.

But "not crashing on gc" being the problem seems to hint at a deeper problem, no? Because the pilot should work independently of the garbage collection, if it can't access the directory it should crash at a later point anyway.

@Ichmed
Copy link
Contributor Author

Ichmed commented Aug 8, 2023

The failing Github Action is the result of an update to https://crates.io/crates/strip-ansi-escapes on 08.08.2023 that changed the return type of strip from Result<Vec<u8>> to Vec<u8>

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

I changed the behaviour of gc, it should now work the same as before.

But "not crashing on gc" being the problem seems to hint at a deeper problem, no? Because the pilot should work independently of the garbage collection, if it can't access the directory it should crash at a later point anyway.

I agree with you a refactoring of that part makes sense just not in this pull request please :)

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

The failing Github Action is the result of an update to https://crates.io/crates/strip-ansi-escapes on 08.08.2023 that changed the return type of strip from Result<Vec<u8>> to Vec<u8>

Hmm, this killed the bill of materials creation. I wondered why apt install cargo leads to a recompilation of cargo. As the recompile fetches from the crates index and then fails to compile due to the change you pointed out, it sounds relatively severe doesn't it ?

Other than disable the sbom action I don't see any fix that we could provide. What's your take on that ?

changed the return type of strip from Result<Vec<u8>> to Vec<u8>

shouldn't this sort of changes be forbidden ? I mean that kills many code or am I mistaken ?

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

ok, I repeated my test with the changes you did. So if I now run the as normal user I get

ms@asterix:/home/ms
> ./ubuntu 
[2023-08-09T08:54:33Z ERROR podman_pilot] Permission denied (os error 13)

This is ok but to be honest a weak error message. Prior the change this was a panic call and it included more details that was also controllable via RUST_BACKTRACE. So we get this

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }', podman-pilot/src/podman.rs:881:59

So this is also not a nice error message but imho better as what we have now ;) I think if we change the mechanics here the error message itself needs to be more clear

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

This was the first finding. The next finding is more severe because the app now doesn't start if the permissions were available. So when I call

ms@asterix:/home/ms
> sudo ./ubuntu 
✓ Launching flake            

==> here it hangs

I had to hit Ctrl-C to get out of the hanging state

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

@Ichmed when you are coding these changes do you actually test the tool ? In case you have issues testing let me know. Thanks

@Ichmed
Copy link
Contributor Author

Ichmed commented Aug 9, 2023

Usually yes, but i did not have time yesterday and i still need some help with the more complex setups.

As for the breaking change, yes that's a semver violation afaik since it changes the public API.

The error messages can be as detailed as we want of course, but maybe we should move them to a separate PR since the big ugly change here is turning everything into a Result, changing the error types or even just messages later is a much smaller change.

@schaefi what images are you using to test with? I'm also currently working on a test suit that could use some simple images that are publicly available

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

The error messages can be as detailed as we want of course, but maybe we should move them to a separate PR since the big ugly change here is turning everything into a Result, changing the error types or even just messages later is a much smaller change.

ok, I can live with that

@schaefi what images are you using to test with? I'm also currently working on a test suit that could use some simple images that are publicly available

I either use the information from the quickstart here: https://github.com/Elektrobit/flake-pilot#oci
or for testing of deltas and layers I have setup a project with images and containers publicly. So you can do the following as an example:

sudo flake-ctl podman pull --uri registry.opensuse.org/home/marcus.schaefer/delta_containers/containers_suse/joe

flake-ctl podman register \
    --container joe \
    --target /usr/bin/joe \
    --app $HOME/joe \
    --base registry.opensuse.org/home/marcus.schaefer/delta_containers/containers_suse/basesystem

sudo $HOME/joe

This will auto fetch the base container, setup the delta through the pilot and runs the joe editor

It's a good test for quite some functions of the pilot and everyone can do it because it's all public

Hope this helps

Thanks

@schaefi
Copy link
Contributor

schaefi commented Aug 9, 2023

As for the breaking change, yes that's a sever violation afaik since it changes the public API.

ok so we are on the same page :) What should we do ? I suggest a commit here that disabled the SBOM action. We can enable it again when things starts to work upstream. I cannot imagine that they won't fix it ;)

@Ichmed I also noticed you are still working from a fork. Do you have a particular reason to do so ? You should have full access permissions to work from branches of the upstream git. I think this would be a bit easier also for me to test changes. But I of course leave this up to you

@Ichmed
Copy link
Contributor Author

Ichmed commented Aug 9, 2023

Yes we should just ignore the action for now. To be honest the crate version should be yanked, i will see if i can contact the maintainer or raise an issue if there isn't already. EDIT: They are already on it

No particular need for the fork, I'm just used to doing it that way so the main repo isn't full of half baked feature branches but i still have my code somewhere else than local.

@Ichmed
Copy link
Contributor Author

Ichmed commented Aug 9, 2023

I found the error, using Command::output() instead of Command::status() blocks because the program doesnt not send EOF, replacing perform() with status() fixes this but i will create a more sophisticated solution tomorrow

@Ichmed Ichmed requested a review from schaefi August 9, 2023 22:29
Copy link
Contributor

@schaefi schaefi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I did several tests and they all worked for me. Very nice implementation going forward with better error handling. Thanks for your work 👍

@schaefi schaefi merged commit 53dc3cf into Elektrobit:master Aug 11, 2023
@Ichmed Ichmed deleted the better_error_handling branch August 11, 2023 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants