-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insufficient retrying around transient errors during uv sync
(os error 104 in this case)
#8144
Comments
cc @konstin |
I believe we already have this. |
If we did retry, we show "Request failed after 3 retries" in the error chain, so this error failed our retrying and was not retried as it should have been. Currently, we only show individual retries with eventual success with uv/crates/uv-client/src/base_client.rs Lines 402 to 413 in ad638d7
|
I'm confused, I don't see "after 3 retries" in the error message? |
there's was a not critically missing in that sentence |
I see this error occasionally, and need to rerun uv, if it is retrying it would be a better user experience to include how many times it tried in the final user error message, this would allow the user to be more empathetic to uv when it happens ("hey, at least it tried!"). |
Oh, with context of the not, as I reported in #3514 (comment), I also occasionally get this error and there is no mention of a retry. |
I agree that there should be messaging about retries, but iirc the problem here specifically is that that code path isn't in our retry logic at all (streaming or range requests, at least that what it's look likes from the logs, this needs a proper investigation of course), so what's missing is hooking this into our retry logic, which will then also print proper feedback about retries. |
Sorry to harp on about this, but currently I've migrated my works dev environment tooling to uv, I took a look at migrating our build tooling to uv yesterday and faced this error several times. Our build tooling has a lower tolerance for errors because it's more difficult to recover from, so I've put migrating on hold for now. I'm not in a rush, I can revisit this in the future, but I imagine it will cause other users not to migrate to uv, and they won't make a GitHub issue about it. Let me know if there's anything useful I can run to help. I can't reproduce it consistently, but I'm happy to try a few times a week. |
I'm not clear where #8144 comes from, so I'm adding some more error context to narrow it down.
I'm not clear where the error originates so I'm trying to get better traces with #8285. |
I'm not clear where #8144 comes from, so I'm adding some more error context to narrow it down.
I see that PR landed so I ran my build setup with FYI, I suspect the network issue on our end is the firewall decides it needs to do a security scan of the package, it kills the network connection, after the security scan is complete the same requests work for, at least a few more days, but at some point it falls out of the firewall cache (most our builds are cached so we don't normally make this network request). Compared to uv, it appears pip's timeout/retry mechanism will wait long enough for the security scan to complete. Find below as much as the end log as I could fit in this post, if you need me to attach the entire log I will need to make a request to access gist (it is blocked by default).
|
Due to the enduring problems with #8144 and related issues, I've opted for a more systemic approach and switched all reqwest errors to not use the implicit-try-fallthrough `#[from]`, but an explicit `#[source]` with a URL attached (except for when there is only one or no URL). This guarantees that we get the URL that failed, and helps identifying the responsible code path.
Due to the enduring problems with #8144 and related issues, I've opted for a more systemic approach and switched all reqwest errors to not use the implicit-try-fallthrough `#[from]`, but an explicit `#[source]` with a URL attached (except for when there is only one or no URL). This guarantees that we get the URL that failed, and helps identifying the responsible code path.
I am following this with keen eyes. Thanks for the efforts here @konstin and also @notatallshaw for providing quick feedback.
music to my ears 🎵 |
Due to the enduring problems with #8144 and related issues, I've opted for a more systemic approach and switched all reqwest errors to not use the implicit-try-fallthrough `#[from]`, but an explicit `#[source]` with a URL attached (except for when there is only one or no URL). This guarantees that we get the URL that failed, and helps identifying the responsible code path.
Another one from within GitHub Actions CI:
This was Instead of |
This week my company made a change to their network security stack meaning I am unlikely to be able to reproduce this error again. So apologies, I see there was an additional PR that landed that gives more info but I am not likely going to be able to trigger it. |
@konstin @charliermarsh I'd love to help here, retrying around transient errors is a passion of mine but I have no Rust experience. Meh. What do we think about prioritizing this relative to other things? We keep seeing this in CI every now and then. Of course, an outer retrying watchdog thingy can always do the trick, but now that Thanks again for all the great work you're doing. |
I spoke too soon, here is an error I got today where I am testing migrating our build system to use uv (and fortunately I am leaving trace logs on until I think we are ready): Trace output log from a
|
Thanks for the verbose logs. We must still be missing a site or two. |
(I'll try to debug today.) |
Oh interesting -- that failure is actually in fetching the metadata, not the wheel itself. |
I think the issue @notatallshaw-gts is experiencing is that we aren't retrying on body errors when fetching |
Yes, I think I mentioned it appeared to be the metadata before (#3514 (comment)), of course, no idea if others have the same issue, but I'm seeing it on a very popular corporate firewall (Palo Alto). |
Yeah I have a reproduction, we just don't seem to be retrying those, but it looks like we should be so still trying to understand why. |
Thank you @charliermarsh! |
FYI, I still got an error on 0.5.3 but I was running docker build without |
With uv 0.4.20 while running
uv sync --locked
I saw this:Environment: github actions
The log does not reveal if any retry was attempted. Such transient errors should certainly be retried for a small-ish number of times, up to a reasonable deadline.
As I have thought and talked a lot about the relevance of retries around transient errors in distributed systems and CI: once we improve the retrying logic it would be great to have explicit logs indicating those error and retries, leading up to success.
The text was updated successfully, but these errors were encountered: