-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using gha cache with mode=max
fails with 400 error
#841
Comments
Maybe @tonistiigi & @crazy-max Any hints or suggestions? I'd be happy to provide more info if required |
@hertzg Could you run a custom version of buildkit that logs out what request (parameters) causes this issue in https://github.com/moby/buildkit/blob/324fc9333bb63a2bf95f341a8d92639f9aa8348c/vendor/github.com/tonistiigi/go-actions-cache/cache.go#L440 @dhadka @chrispat Any ideas what this could be? Seems to come from https://github.com/tonistiigi/go-actions-cache/blob/4d48f2ff622acbb68ad134425c8df3096fec0229/cache.go#L272 |
@tonistiigi I could yes I'm not sure I know how exactly thou, I assume I need a built image with |
@tonistiigi Gentle bump 😊 |
@hertzg I pushed an image with additional logs. |
Thanks for the instructions, I've setup a branch with those changes (incl. actions debug), the build is running, this might take an hour to finish. I'll report back with the results. |
First attempt: failed with timeouts, I've restarted it to see if its flakiness, It seems to have passed on the 4'th retry. |
@tonistiigi I managed reproduce it again After manually adding the
Looks like maybe it's some kind of rate-limiting issue? 🤔 /cc @dhadka @chrispat |
@kmkumaran can you have some one take a look here? |
Is the retry logic safe? From what I can tell, it seems like the request is getting interrupted before getting to the cache API (we're seeing a Client Disconnected error). I'm not a Go developer, but I do see it's creating a reader for the JSON body (link to code):
Does it need to reset that Reader to the start of the content or make sure it's not closed? My hunch is that it's not sending the correct JSON body on the retry and the request is subsequently getting interrupted. This is also probably why the 400 error says |
@dhadka That could be the issue indeed. Thanks for the pointer. I'll update that logic. |
@tonistiigi Any builds you'd like me to try out? |
test build |
I am no longer able to reproduce the issue with |
@tonistiigi I see that the PR has been merged and approved, Unfortunately I'm unaware of the release cyles of buildkit, could you hint me in the correct direction where I could find when to expect the next release that would include this change? 🙌 |
@hertzg We will probably not do another patch release just for this unless something else comes up as well. You can use the master branch image(or pin it to digest for safety). cc @crazy-max |
fixed in moby/buildkit#2506 |
Thanks for all the great work everyone involved 🙌 For general reference this is the digest I'm going for
|
SInce I've started using the pinned digest with the changes, I noticed a lot of timeout errors. With messages similar to:
Is this an issue on the cache side or ... ? Links: |
@hertzg Yes, this is on the cache side. I'm seeing requests getting rate limited. There's a limit on the number of caches that can be created every 5 minute window, and this action can exceed that if the number of layers and/or matrix build is large. We're looking to increase that limit as long as it doesn't impact the health of the cache service (which I don't expect it to, but just want to be careful 😄) |
Atm the cache library will retry with a backoff but if the service does not return in 5 min then you get an error like this. |
@dhadka Thanks for the detailed explanation, what would be the recommended solution here other than reducing the amount of jobs? @tonistiigi As I understand it now, the whole build is getting marked as failed because the rate limiting window and total timeout are very close to each other (both 5min). Would increasing the timeout help and not make the build fail? |
As of current state, I had to disable it as it's just failing the builds and always needs restarts😞 |
@hertzg we will be looking into relaxing the rate limit. But we need to carefully evaluate the extra load it will bring on the system. I will update you with an ETA. |
Hello agian from the new Year, (hopefully) after all the festivities :) |
@hertzg apologize for the delay, but the good new is that we are rolling out a change to increase the rate limit threshold. This should allow buildx action to cache layers with higher level of parallelization. cc @t-dedah to keep you in loop when rollout is complete. Would love to hear back on whether this helps in reducing the |
@hertzg We have increased the RU limit from 450 to 700 on all the rings. This should reduce the number 429s significantly. We saw a significant reduction in 429 after increasing the RU limit as shown in below images. BeforeAfter |
@t-dedah I'm seeing timeouts and builds still failing right after enabling https://github.com/hertzg/rtl_433_docker/runs/5523758504?check_suite_focus=true |
Hi @hertzg There is still a limit on number of requests we can make to seal and reserve cache. We have almost doubled that limit but if any action breaches that limit, we will still get a 429. |
@t-dedah @dhadka I'll try to give some more background on where I think our requests come from. Let's say the user is doing a build that touches, for example, 50 blobs/layers in total(with Github cache users usually would export all intermediate layers as well, not only the final result layers so this can grow fast). If that build is mostly cached, for example updated only 1 layer we will create a new "manifest" blob with links to these 50 layers. We will push the manifest and new layer, but most importantly we still need to make a request for all the old 49 blobs just to check if they still exist. We will make a request for all of them, Github will answer that record exists and we can continue. This needs to happen for all builds and some repositories make a lot of builds in a single workflow or very complex(multi-platform) builds with lots of layers. Even if lots of builds mostly share same layers we need to check them all on each build. If there would be an endpoint we could use just to check that a cache key exists that would not be rate-limited(or with a much higher limit) it would be much less likely to hit these limits. That endpoint does not need to provide a download link or reserve a key like the current requests do. It could also be a batch endpoint to check multiple keys together. Maybe even just endpoint to list all current keys would be manageable as keys should be small and not take much room even if there are a lot of them. Another way would be for us to somehow remember that the cache key existed and not check it more often than some timeframe. But for that, we would need some kind of external storage/database where we could keep these records. Looks like this is a private repository. |
If you're using the cache toolkit module, then read and writes are handled in separate calls ( Inside
Although this is getting into some of the internal implementation. Perhaps we need to add another top-level function alongside This |
I think this is the same https://github.com/tonistiigi/go-actions-cache/blob/master/api.md#get-cache we already use. We don't actually pull the cache in the case described above but the request we make contains a download link. I'm not sure if adding this download link there is a reason you need to rate-limit this request. |
Hi @tonistiigi We already have a |
I believe this is the same call I linked earlier and I think this is the one that most likely gets the limit error as these are the most common requests we send. The way to avoid these requests would be to add an endpoint that would return all existing cache keys for a scope so we don't need to check them individually. Or add another Additional question: Are the limits per workflow step or more global? I assume they are for the whole workflow/repo/user. If they are not we might be able to save some requests by caching the "exist" results locally in case there are repeated builds inside the same workflow step. |
The logs shared above shows that 429 is due to the POST call.
The limit is for a particular user which would mean its linked to the repo rather than to a workflow run.
The addition of a new API might take some time but we can try increasing the limit for GET call. Ill get back to you on this one after dicussing with my team. |
Hi @tonistiigi Summary:-
|
…oid timeouts May be related to docker/buildx#841
…oid timeouts May be related to docker/buildx#841
…eouts May be related to docker/buildx#841 and moby/buildkit#2804
May be related to docker/buildx#841 and moby/buildkit#2804
…oid timeouts May be related to docker/buildx#841
…oid timeouts May be related to docker/buildx#841
…eouts May be related to docker/buildx#841 and moby/buildkit#2804
May be related to docker/buildx#841 and moby/buildkit#2804
Hello, I have recently started using
cahe-to "type=gha,mode=max,scope=..."
to cache all the layers and the following error seemed to persist only for specific builds (consistently failing on the same). After removeing themode=max
the issues went away but obviousely not everything is cached.Failing builds with
mode=max
:Passing build after removing
mode=max
The text was updated successfully, but these errors were encountered: