Using gha cache with `mode=max` fails with 400 error #841

hertzg · 2021-11-14T09:18:50Z

Hello, I have recently started using cahe-to "type=gha,mode=max,scope=..." to cache all the layers and the following error seemed to persist only for specific builds (consistently failing on the same). After removeing the mode=max the issues went away but obviousely not everything is cached.

Failing builds with mode=max:

Passing build after removing mode=max

https://github.com/hertzg/rtl_433_docker/actions/runs/1457410662

#130 exporting cache
#130 preparing build cache for export
#130 preparing build cache for export 121.0s done
#130 writing layer sha256:06be072867b08bb9aef2e469533d0d0d6a85f97c2aabcaf5103d78c217977918
#130 writing layer sha256:06be072867b08bb9aef2e469533d0d0d6a85f97c2aabcaf5103d78c217977918 0.1s done
#130 writing layer sha256:08980434c63c0751ea69086b9c8282a5b2784f78643934cf62036ae07e94b943
#130 writing layer sha256:08980434c63c0751ea69086b9c8282a5b2784f78643934cf62036ae07e94b943 0.1s done
...
redacted to keep it short
...
#130 writing layer sha256:d91065fb02477295b9823f300711ac850313d3b6a81a6ca1f8d214f8700b8b2e
#130 writing layer sha256:d91065fb02477295b9823f300711ac850313d3b6a81a6ca1f8d214f8700b8b2e 4.0s done
#130 ERROR: error writing layer blob: error committing cache 37887: failed to parse error response 400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request</h2>
<hr><p>HTTP Error 400. The request is badly formed.</p>
</BODY></HTML>
: invalid character '<' looking for beginning of value
------
 > exporting cache:
------
error: failed to solve: error writing layer blob: error committing cache 37887: failed to parse error response 400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request</h2>
<hr><p>HTTP Error 400. The request is badly formed.</p>
</BODY></HTML>
: invalid character '<' looking for beginning of value
Error: buildx failed with: : invalid character '<' looking for beginning of value

The text was updated successfully, but these errors were encountered:

hertzg · 2021-11-16T17:37:51Z

Maybe @tonistiigi & @crazy-max Any hints or suggestions? I'd be happy to provide more info if required

tonistiigi · 2021-11-17T05:42:00Z

@hertzg Could you run a custom version of buildkit that logs out what request (parameters) causes this issue in https://github.com/moby/buildkit/blob/324fc9333bb63a2bf95f341a8d92639f9aa8348c/vendor/github.com/tonistiigi/go-actions-cache/cache.go#L440

@dhadka @chrispat Any ideas what this could be? Seems to come from https://github.com/tonistiigi/go-actions-cache/blob/4d48f2ff622acbb68ad134425c8df3096fec0229/cache.go#L272

hertzg · 2021-11-18T21:12:59Z

@tonistiigi I could yes I'm not sure I know how exactly thou, I assume I need a built image with moby/buildx for that commit so I could use it with docker buildx create .... --driver-opt image=...? If so could you help with the buildkit image for that commit that I could pull?

hertzg · 2021-11-23T15:51:16Z

@tonistiigi Gentle bump 😊

tonistiigi · 2021-11-24T01:35:31Z

@hertzg I pushed an image with additional logs. tonistiigi/buildkit:gha-logs moby/buildkit@master...tonistiigi:gha-request-logs . Set it as buildkit image and report back. You also need to enable debug, if you use docker's actions then just enable the debug mode in github actions. If you call buildx create manually then you need to set --buildkitd-flags '--debug'

hertzg · 2021-11-24T12:10:36Z

Thanks for the instructions, I've setup a branch with those changes (incl. actions debug), the build is running, this might take an hour to finish.

I'll report back with the results.

hertzg · 2021-11-24T13:28:00Z

First attempt: failed with timeouts, I've restarted it to see if its flakiness, It seems to have passed on the 4'th retry.

hertzg · 2021-11-25T10:33:17Z

@tonistiigi I managed reproduce it again ~~I don't see any new error messages so far~~.

After manually adding the --debug to --buildkitd-flags logs appeared in post setup action.

2021-11-25T11:07:28.4664930Z 2021/11/25 11:07:26 > request POST https://artifactcache.actions.githubusercontent.com/7UerGt9E8dgwAyL2ogc3QGsbFOJ4WODvM1ysTtaU0T2efb087j/_apis/artifactcache/caches/49370 , err: <nil>
2021-11-25T11:07:28.4672333Z 2021/11/25 11:07:26 < response 429 Too Many Requests map[Activityid:[3db07705-7b00-43ef-b1bf-d9d6ce0d8a44] Content-Length:[315] Content-Type:[application/json] Date:[Thu, 25 Nov 2021 11:07:26 GMT] Retry-After:[60] Strict-Transport-Security:[max-age=2592000] X-Cache:[CONFIG_NOCACHE] X-Frame-Options:[SAMEORIGIN] X-Msedge-Ref:[Ref A: A66EBD35FE94468991D34EC956C93CBB Ref B: ASHEDGE1519 Ref C: 2021-11-25T11:07:26Z] X-Ratelimit-Limit:[493] X-Ratelimit-Remaining:[0] X-Ratelimit-Reset:[1637838743] X-Ratelimit-Resource:[ArtifactCache/Short] X-Tfs-Processid:[2ef82cec-56f3-4c18-b269-d6d60d120383] X-Tfs-Serviceerror:[Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27.] X-Tfs-Session:[3db07705-7b00-43ef-b1bf-d9d6ce0d8a44] X-Vss-E2eid:[3db07705-7b00-43ef-b1bf-d9d6ce0d8a44] X-Vss-Senderdeploymentid:[5f814983-29e7-4f0f-d4b8-8016a66b281b]]
2021-11-25T11:07:28.4679020Z 2021/11/25 11:07:27 > request POST https://artifactcache.actions.githubusercontent.com/7UerGt9E8dgwAyL2ogc3QGsbFOJ4WODvM1ysTtaU0T2efb087j/_apis/artifactcache/caches/49370 , err: <nil>
2021-11-25T11:07:28.4681951Z 2021/11/25 11:07:27 < response 400 Bad Request map[Content-Length:[311] Content-Type:[text/html; charset=us-ascii] Date:[Thu, 25 Nov 2021 11:07:27 GMT]]
2021-11-25T11:07:28.4684755Z time="2021-11-25T11:07:27Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = error writing layer blob: error committing cache 49370: failed to parse error response 400: <!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\"http://www.w3.org/TR/html4/strict.dtd\">\r\n<HTML><HEAD><TITLE>Bad Request</TITLE>\r\n<META HTTP-EQUIV=\"Content-Type\" Content=\"text/html; charset=us-ascii\"></HEAD>\r\n<BODY><h2>Bad Request</h2>\r\n<hr><p>HTTP Error 400. The request is badly formed.</p>\r\n</BODY></HTML>\r\n: invalid character '<' looking for beginning of value\n"

Looks like maybe it's some kind of rate-limiting issue? 🤔 /cc @dhadka @chrispat

tonistiigi · 2021-11-26T02:16:11Z

@dhadka @chrispat Could it be that after POST request gets hit by rate-limiting, the retry receives 400? Could it at least not return an HTML error? It seems to be missing all the API headers, as if some kind of proxy responded with 400.

hertzg · 2021-12-01T18:59:06Z

@dhadka @chrispat gentle bump 😊

chrispat · 2021-12-01T19:29:20Z

@kmkumaran can you have some one take a look here?

dhadka · 2021-12-01T19:57:42Z

Is the retry logic safe? From what I can tell, it seems like the request is getting interrupted before getting to the cache API (we're seeing a Client Disconnected error).

I'm not a Go developer, but I do see it's creating a reader for the JSON body (link to code):

req, err := http.NewRequest("POST", c.url(fmt.Sprintf("caches/%d", id)), bytes.NewReader(dt))

Does it need to reset that Reader to the start of the content or make sure it's not closed?

My hunch is that it's not sending the correct JSON body on the retry and the request is subsequently getting interrupted. This is also probably why the 400 error says The request is badly formed.

tonistiigi · 2021-12-01T20:08:37Z

@dhadka That could be the issue indeed. Thanks for the pointer. I'll update that logic.

tonistiigi · 2021-12-02T01:50:29Z

tonistiigi/go-actions-cache#12

hertzg · 2021-12-03T12:59:10Z

@tonistiigi Any builds you'd like me to try out?

tonistiigi · 2021-12-04T00:01:49Z

moby/buildkit#2506

test build tonistiigi/buildkit:pr2506

hertzg · 2021-12-04T22:25:30Z

I am no longer able to reproduce the issue with tonistiigi/buildkit:pr2506 buildkit image. 🎉

hertzg · 2021-12-06T16:05:19Z

@tonistiigi I see that the PR has been merged and approved, Unfortunately I'm unaware of the release cyles of buildkit, could you hint me in the correct direction where I could find when to expect the next release that would include this change? 🙌

tonistiigi · 2021-12-06T19:15:11Z

@hertzg We will probably not do another patch release just for this unless something else comes up as well. You can use the master branch image(or pin it to digest for safety). cc @crazy-max

tonistiigi · 2021-12-06T19:15:17Z

fixed in moby/buildkit#2506

hertzg · 2021-12-06T19:53:32Z

Thanks for all the great work everyone involved 🙌

For general reference this is the digest I'm going for

moby/buildkit@sha256:b8d8458bbab41b575614801826d3e0f50e93d8730aaf68431713e3bdb4378744

hertzg · 2021-12-09T12:53:37Z

SInce I've started using the pinned digest with the changes, I noticed a lot of timeout errors.

With messages similar to:

buildx failed with: error: failed to solve: error writing layer blob: error committing cache 66034: maximum timeout reached
buildx failed with: error: failed to solve: maximum timeout reached

Is this an issue on the cache side or ... ?

Links:

dhadka · 2021-12-09T16:13:47Z

@hertzg Yes, this is on the cache side. I'm seeing requests getting rate limited. There's a limit on the number of caches that can be created every 5 minute window, and this action can exceed that if the number of layers and/or matrix build is large. We're looking to increase that limit as long as it doesn't impact the health of the cache service (which I don't expect it to, but just want to be careful 😄)

tonistiigi · 2021-12-09T16:55:30Z

error committing cache 66034: maximum timeout reached

Atm the cache library will retry with a backoff but if the service does not return in 5 min then you get an error like this.

hertzg · 2021-12-10T07:59:18Z

@dhadka Thanks for the detailed explanation, what would be the recommended solution here other than reducing the amount of jobs?

@tonistiigi As I understand it now, the whole build is getting marked as failed because the rate limiting window and total timeout are very close to each other (both 5min). Would increasing the timeout help and not make the build fail?

hertzg · 2021-12-14T11:57:28Z

As of current state, I had to disable it as it's just failing the builds and always needs restarts😞

bishal-pdMSFT · 2021-12-15T12:53:08Z

@hertzg we will be looking into relaxing the rate limit. But we need to carefully evaluate the extra load it will bring on the system. I will update you with an ETA.

hertzg · 2022-01-09T16:05:12Z

Hello agian from the new Year, (hopefully) after all the festivities :)
@bishal-pdMSFT Where can we track the progress for the limit increase evaluation?

bishal-pdMSFT · 2022-02-23T19:23:10Z

@hertzg apologize for the delay, but the good new is that we are rolling out a change to increase the rate limit threshold. This should allow buildx action to cache layers with higher level of parallelization. cc @t-dedah to keep you in loop when rollout is complete.

Would love to hear back on whether this helps in reducing the 429s.

t-dedah · 2022-03-11T08:48:25Z

@hertzg We have increased the RU limit from 450 to 700 on all the rings. This should reduce the number 429s significantly.

We saw a significant reduction in 429 after increasing the RU limit as shown in below images.

Before

After

hertzg · 2022-03-12T21:25:53Z

@t-dedah I'm seeing timeouts and builds still failing right after enabling mode=max

https://github.com/hertzg/rtl_433_docker/runs/5523758504?check_suite_focus=true

t-dedah · 2022-03-15T09:47:16Z

Hi @hertzg There is still a limit on number of requests we can make to seal and reserve cache. We have almost doubled that limit but if any action breaches that limit, we will still get a 429.
The current limit is set to 700 for both seal and reserve cache so 350 for each seal and reserve cache. If you have can help with a number which you think will be enough for mode=max then we can test and see how to get as close as possible to that limit.

The reasoning behind the new 700 limit

tonistiigi · 2022-03-15T23:45:39Z

@t-dedah @dhadka I'll try to give some more background on where I think our requests come from.

Let's say the user is doing a build that touches, for example, 50 blobs/layers in total(with Github cache users usually would export all intermediate layers as well, not only the final result layers so this can grow fast). If that build is mostly cached, for example updated only 1 layer we will create a new "manifest" blob with links to these 50 layers. We will push the manifest and new layer, but most importantly we still need to make a request for all the old 49 blobs just to check if they still exist. We will make a request for all of them, Github will answer that record exists and we can continue. This needs to happen for all builds and some repositories make a lot of builds in a single workflow or very complex(multi-platform) builds with lots of layers. Even if lots of builds mostly share same layers we need to check them all on each build.

If there would be an endpoint we could use just to check that a cache key exists that would not be rate-limited(or with a much higher limit) it would be much less likely to hit these limits. That endpoint does not need to provide a download link or reserve a key like the current requests do. It could also be a batch endpoint to check multiple keys together. Maybe even just endpoint to list all current keys would be manageable as keys should be small and not take much room even if there are a lot of them.

Another way would be for us to somehow remember that the cache key existed and not check it more often than some timeframe. But for that, we would need some kind of external storage/database where we could keep these records.

The reasoning behind the new 700 limit

Looks like this is a private repository.

dhadka · 2022-03-17T13:51:05Z

If there would be an endpoint we could use just to check that a cache key exists that would not be rate-limited(or with a much higher limit) it would be much less likely to hit these limits.

If you're using the cache toolkit module, then read and writes are handled in separate calls (restoreCache vs saveCache). But restoreCache checks if the cache exists + downloads the content.

Inside restoreCache, it makes a call to cacheHttpClient.getCacheEntry that can be used to just check if the record exists:

  const cacheEntry = await cacheHttpClient.getCacheEntry(keys, paths, {
    compressionMethod
  })
  if (!cacheEntry?.archiveLocation) {
    // Cache not found
    return undefined
  }

Although this is getting into some of the internal implementation. Perhaps we need to add another top-level function alongside restoreCache and saveCache that just checks if the cache exists.

This GET operation also has a higher rate limit (900). Although now that the seal/reserve limit was increased (300 -> 700), it's not much higher 😄 @t-dedah can we please look into increasing this limit?

tonistiigi · 2022-03-17T17:40:20Z

I think this is the same https://github.com/tonistiigi/go-actions-cache/blob/master/api.md#get-cache we already use. We don't actually pull the cache in the case described above but the request we make contains a download link. I'm not sure if adding this download link there is a reason you need to rate-limit this request.

t-dedah · 2022-05-12T02:07:26Z

Hi @tonistiigi We already have a GET call which is also being used in Load function, so will it be possible for you to use that to check if the layer exists and only then make a reserve call if necessary. We can also increase the RU limit for GET in order to handle the increased load with this change.

tonistiigi · 2022-05-12T02:19:21Z

I believe this is the same call I linked earlier and I think this is the one that most likely gets the limit error as these are the most common requests we send.

The way to avoid these requests would be to add an endpoint that would return all existing cache keys for a scope so we don't need to check them individually. Or add another GET style request that wouldn't return a download link and would instead allow higher limits(as this endpoint should be safer as it can't cause excess traffic).

Additional question: Are the limits per workflow step or more global? I assume they are for the whole workflow/repo/user. If they are not we might be able to save some requests by caching the "exist" results locally in case there are repeated builds inside the same workflow step.

t-dedah · 2022-05-12T07:13:53Z

I believe this is the same call I linked earlier and I think this is the one that most likely gets the limit error as these are the most common requests we send.

The logs shared above shows that 429 is due to the POST call.

Are the limits per workflow step or more global?

The limit is for a particular user which would mean its linked to the repo rather than to a workflow run.

The way to avoid these requests would be to add an endpoint that would return all existing cache keys for a scope so we don't need to check them individually. Or add another GET style request that wouldn't return a download link and would instead allow higher limits(as this endpoint should be safer as it can't cause excess traffic).

The addition of a new API might take some time but we can try increasing the limit for GET call. Ill get back to you on this one after dicussing with my team.

t-dedah · 2022-05-17T05:59:11Z

Hi @tonistiigi

Summary:-

You can use GET api to check if the cache exists so we dont have to make multiple Reserve calls. The limit for GET is 900 which should we enough and we can also increase this further, if needed. If the cache key doesnt exist then you can reserve that layer only.
In the future we will also have a LIST api which will further reduce the number of REST calls needed. This API is in planning stage at the moment.
We are also increasing the limit for reserve api to 700

…oid timeouts May be related to docker/buildx#841

…eouts May be related to docker/buildx#841 and moby/buildkit#2804

May be related to docker/buildx#841 and moby/buildkit#2804

…oid timeouts May be related to docker/buildx#841

…eouts May be related to docker/buildx#841 and moby/buildkit#2804

May be related to docker/buildx#841 and moby/buildkit#2804

hertzg mentioned this issue Nov 25, 2021

feat(cache): use mode=max for faster builds hertzg/rtl_433_docker#43

Merged

tonistiigi mentioned this issue Nov 26, 2021

Retry on rate-limiting could be smarter tonistiigi/go-actions-cache#11

Open

mtsmfm mentioned this issue Dec 6, 2021

Fix cache mtsmfm/devimages#57

Merged

tonistiigi closed this as completed Dec 6, 2021

hertzg mentioned this issue Mar 11, 2022

feat: re-introduce gha max cache mode hertzg/rtl_433_docker#53

Merged

crazy-max added the area/cache label Sep 10, 2022

mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023

Disable Docker cache mode=max for some layers in GitHub Actions to av…

b7d4e19

…oid timeouts May be related to docker/buildx#841

mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023

Disable Docker cache mode=max for some layers in GitHub Actions to av…

9ac2919

…oid timeouts May be related to docker/buildx#841

mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023

Disable Docker cache mode=max entirely in GitHub Actions to avoid tim…

4ba2634

…eouts May be related to docker/buildx#841 and moby/buildkit#2804

mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023

Disable Docker cache for some images to avoid timeouts

3008e1f

May be related to docker/buildx#841 and moby/buildkit#2804

huonw mentioned this issue Nov 1, 2023

GitHub Actions Cache backend easily hits rate limit errors pantsbuild/pants#20133

Open

phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023

Disable Docker cache mode=max for some layers in GitHub Actions to av…

d85069e

…oid timeouts May be related to docker/buildx#841

phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023

Disable Docker cache mode=max for some layers in GitHub Actions to av…

0068391

…oid timeouts May be related to docker/buildx#841

phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023

Disable Docker cache mode=max entirely in GitHub Actions to avoid tim…

0098aee

…eouts May be related to docker/buildx#841 and moby/buildkit#2804

phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023

Disable Docker cache for some images to avoid timeouts

322a40c

May be related to docker/buildx#841 and moby/buildkit#2804

This was referenced Apr 10, 2024

Implement webhook validation for the TFJob kubeflow/training-operator#2051

Merged

CI: Use a mode=min in the builder cache kubeflow/training-operator#2053

Merged

Using gha cache with mode=max fails with 400 error #841

Using gha cache with mode=max fails with 400 error #841

Comments

hertzg commented Nov 14, 2021

hertzg commented Nov 16, 2021

tonistiigi commented Nov 17, 2021

hertzg commented Nov 18, 2021 • edited Loading

hertzg commented Nov 23, 2021

tonistiigi commented Nov 24, 2021

hertzg commented Nov 24, 2021 • edited Loading

hertzg commented Nov 24, 2021 • edited Loading

hertzg commented Nov 25, 2021 • edited Loading

tonistiigi commented Nov 26, 2021

hertzg commented Dec 1, 2021

chrispat commented Dec 1, 2021

dhadka commented Dec 1, 2021 • edited Loading

tonistiigi commented Dec 1, 2021

tonistiigi commented Dec 2, 2021

hertzg commented Dec 3, 2021

tonistiigi commented Dec 4, 2021

hertzg commented Dec 4, 2021

hertzg commented Dec 6, 2021 • edited Loading

tonistiigi commented Dec 6, 2021

tonistiigi commented Dec 6, 2021

hertzg commented Dec 6, 2021

hertzg commented Dec 9, 2021 • edited Loading

dhadka commented Dec 9, 2021 • edited Loading

tonistiigi commented Dec 9, 2021

hertzg commented Dec 10, 2021

hertzg commented Dec 14, 2021

bishal-pdMSFT commented Dec 15, 2021

hertzg commented Jan 9, 2022 • edited Loading

bishal-pdMSFT commented Feb 23, 2022

t-dedah commented Mar 11, 2022

Before

After

hertzg commented Mar 12, 2022

t-dedah commented Mar 15, 2022

tonistiigi commented Mar 15, 2022 • edited Loading

dhadka commented Mar 17, 2022

tonistiigi commented Mar 17, 2022

t-dedah commented May 12, 2022

tonistiigi commented May 12, 2022

t-dedah commented May 12, 2022

t-dedah commented May 17, 2022 • edited Loading

Using gha cache with `mode=max` fails with 400 error #841

Using gha cache with `mode=max` fails with 400 error #841

hertzg commented Nov 18, 2021 •

edited

Loading

hertzg commented Nov 24, 2021 •

edited

Loading

hertzg commented Nov 24, 2021 •

edited

Loading

hertzg commented Nov 25, 2021 •

edited

Loading

dhadka commented Dec 1, 2021 •

edited

Loading

hertzg commented Dec 6, 2021 •

edited

Loading

hertzg commented Dec 9, 2021 •

edited

Loading

dhadka commented Dec 9, 2021 •

edited

Loading

hertzg commented Jan 9, 2022 •

edited

Loading

tonistiigi commented Mar 15, 2022 •

edited

Loading

t-dedah commented May 17, 2022 •

edited

Loading