-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub Actions Cache backend easily hits rate limit errors #20133
Comments
Brainstorming some potential ways to improve this:
|
This adds a bunch of additional metrics to provide more insight into the byte store behaviour: for each of the "read", "write" and "list missing digests" (aka "exists") operations, track: - number of attempts - number of successes (for "read", this is further split into "cached" and "uncached", i.e. whether the successful network request found data or not) - number of errors This mimics the metrics for the remote action cache. The minor shuffling of code in this PR also fixes a bug with the `RemoteStoreBlobBytesDownloaded` metric: before this PR, any successful network request would count as bytes downloaded, even if the digest didn't exist/wasn't downloaded (i.e. it successfully determined that the digest wasn't cached). This PR restricts the metric to only count digest bytes when the network request is successful _and_ the digest actually existed/was downloaded. I'm hopeful this will at least give more insight into the scale of the problem in #20133, in addition to being generally useful as "what's pants up to" reporting.
I've asked support and they said:
|
This takes the unusual step of _removing_ documentation. In particular, this removes the documentation for the still _very_ experimental new remote cache backends: GitHub Actions and file system, because they're useless and very unstable: - there's significant user-facing refactoring work that I'm not going to get to for 2.19 (#19902), particularly around changing how the options are configured - the GHA backend seems to be very... limited, and only works for small repos (#20133) I'm targeting 2.19, as it stabilises, and leaving the docs in `main` (for 2.20).
The #20138 metrics look like this in https://github.com/huonw/pants-cache-github-actions-poc/tree/huge. In particular, there's a lot of
|
… (Cherry-pick of #20690) (#20692) I think #20133 is basically a blocker for serious useful usage of the `remote_provider = "experimental-github-actions-cache"` backend, so it's worth calling out specifically, rather than forcing users to discover the errors and track down the issue themselves. Co-authored-by: Huon Wilson <[email protected]>
Any updates ? |
No change that I know of. |
Describe the bug
The new experimental remote cache backend for github actions (#19831) seems to easily hit rate limits in practice, presumably because it's designed for coarse grained caching of larger artefacts rather than the very fine grained caching of individual items that Pants uses it for.
For instance (full messages below):
[WARN] Failed to read from remote cache (1 occurrences so far): failed to read ...: RateLimited (persistent) at read, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
[WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to ...: RateLimited (persistent) at write, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
Click to expand for example full error messages
read:
09:55:54.25 [WARN] Failed to read from remote cache (1 occurrences so far): failed to read huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a: RateLimited (persistent) at read, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/cache?keys=huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a&version=pants-1, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:55:53 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "176cbe80-c272-43ec-99d8-58b95bda99a0", "activityid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-tfs-session": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-e2eid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "1623", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, service: ghac, path: huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a, range: 0- } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
write:
09:56:01.76 [WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b: RateLimited (persistent) at write, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/caches, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:56:01 GMT", "server": "Kestrel", "retry-after": "286", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "1fb111d5-08ce-442f-ad9a-1c9c1c64f68f", "activityid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-tfs-session": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-e2eid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "734", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, called: Backend::ghac_reserve, service: ghac, path: huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
(NB. inline code formatting to get better line breaking)
This means it seems to work well for small examples, but falls over as they become larger. Thus, mitigations are likely required.
For instance, https://github.com/huonw/pants-cache-github-actions-poc/tree/cf5c53b368f0b5229f103465f25005b6db465988 (on the
huge
branch) sets up 100 codegen targets, that generate 100 files each, and then a corresponding test so thatpants test ::
will force the codegen targets to do something. This means 200 process executions and 10k files need to be uploaded or downloaded from the GHA cache in quick succession.A build that has a (partially) seeded remote cache quickly hits errors: https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6718179818/job/18257496697
Setting
remote_store_rpc_concurrency = 1
(down from the default of 128) didn't stop this occurring in my work repo, i.e. maxed-out sequential interactions seemed to be enough to hit the rate limits too. (It did appear to help with the reduced example above.)Picking out headers from those messages that might seem relevant:
x-ratelimit-limit
header has values 1623, 734. I've seen it as high as ~3000 in my work repo.x-ratelimit-remaining
header unsurprisingly always has value 0x-ratelimit-reset
(unix timestamp?) andretry-after
(seconds to wait?) headers don't exactly align, but they're both order of minutes to wait, e.g. for the first one, the request ran at2023-11-01T09:55:53+00:00
and the rate limiting says:x-ratelimit-reset
:2023-11-01T09:58:37+00:00
(datetime.fromtimestamp(1698832717).astimezone(timezone.utc).isoformat()
)date
+retry-after
:2023-11-01T09:56:53+00:00
((datetime(2023, 11, 1, 9, 55, 53, tzinfo=timezone.utc) + timedelta(seconds=60)).isoformat()
)Info about rate limits and other similar tools experience:
mode=max
fails with 400 error docker/buildx#841 (comment) and onwards (particularly the summary at the end Using gha cache withmode=max
fails with 400 error docker/buildx#841 (comment)), in May 2022, people who are likely Microsoft employees suggest that the "Get API" has a rate limit of "900" (per second?), while the "reserve api" has one of "700", and suggest one could use the "get" API to reduce the reserve-spam.Pants version
2.19.0.dev4
OS
GHA ubuntu runner
Additional info
N/A
The text was updated successfully, but these errors were encountered: