GitHub Actions Cache backend easily hits rate limit errors #20133

huonw · 2023-11-01T10:31:05Z

Describe the bug

The new experimental remote cache backend for github actions (#19831) seems to easily hit rate limits in practice, presumably because it's designed for coarse grained caching of larger artefacts rather than the very fine grained caching of individual items that Pants uses it for.

For instance (full messages below):

[WARN] Failed to read from remote cache (1 occurrences so far): failed to read ...: RateLimited (persistent) at read, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

[WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to ...: RateLimited (persistent) at write, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

Click to expand for example full error messages

read:

09:55:54.25 [WARN] Failed to read from remote cache (1 occurrences so far): failed to read huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a: RateLimited (persistent) at read, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/cache?keys=huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a&version=pants-1, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:55:53 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "176cbe80-c272-43ec-99d8-58b95bda99a0", "activityid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-tfs-session": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-e2eid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "1623", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, service: ghac, path: huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a, range: 0- } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

write:

09:56:01.76 [WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b: RateLimited (persistent) at write, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/caches, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:56:01 GMT", "server": "Kestrel", "retry-after": "286", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "1fb111d5-08ce-442f-ad9a-1c9c1c64f68f", "activityid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-tfs-session": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-e2eid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "734", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, called: Backend::ghac_reserve, service: ghac, path: huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

(NB. inline code formatting to get better line breaking)

This means it seems to work well for small examples, but falls over as they become larger. Thus, mitigations are likely required.

For instance, https://github.com/huonw/pants-cache-github-actions-poc/tree/cf5c53b368f0b5229f103465f25005b6db465988 (on the huge branch) sets up 100 codegen targets, that generate 100 files each, and then a corresponding test so that pants test :: will force the codegen targets to do something. This means 200 process executions and 10k files need to be uploaded or downloaded from the GHA cache in quick succession.

A build that has a (partially) seeded remote cache quickly hits errors: https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6718179818/job/18257496697

Setting remote_store_rpc_concurrency = 1 (down from the default of 128) didn't stop this occurring in my work repo, i.e. maxed-out sequential interactions seemed to be enough to hit the rate limits too. (It did appear to help with the reduced example above.)

Picking out headers from those messages that might seem relevant:

The x-ratelimit-limit header has values 1623, 734. I've seen it as high as ~3000 in my work repo.
The x-ratelimit-remaining header unsurprisingly always has value 0
The x-ratelimit-reset (unix timestamp?) and retry-after (seconds to wait?) headers don't exactly align, but they're both order of minutes to wait, e.g. for the first one, the request ran at 2023-11-01T09:55:53+00:00 and the rate limiting says:
- x-ratelimit-reset: 2023-11-01T09:58:37+00:00 (datetime.fromtimestamp(1698832717).astimezone(timezone.utc).isoformat())
- date + retry-after: 2023-11-01T09:56:53+00:00 ((datetime(2023, 11, 1, 9, 55, 53, tzinfo=timezone.utc) + timedelta(seconds=60)).isoformat())

Info about rate limits and other similar tools experience:

In GHA export cache fails with mode=max moby/buildkit#2276 (comment) , docker buildkit's gha caching backend hit something similar and someone who's profile describes them as a GitHub employee says "approximately 150 cache creations every 5 minutes" in August 2021
In GitHub Actions cache is surprisingly slow mozilla/sccache#1485, sccache references others hitting rate limits ()
From Using gha cache with mode=max fails with 400 error docker/buildx#841 (comment) and onwards (particularly the summary at the end Using gha cache with mode=max fails with 400 error docker/buildx#841 (comment)), in May 2022, people who are likely Microsoft employees suggest that the "Get API" has a rate limit of "900" (per second?), while the "reserve api" has one of "700", and suggest one could use the "get" API to reduce the reserve-spam.

Pants version
2.19.0.dev4

OS
GHA ubuntu runner

Additional info
N/A

The text was updated successfully, but these errors were encountered:

huonw · 2023-11-01T10:36:20Z

Brainstorming some potential ways to improve this:

use the "GET API" somehow (no idea how, yet)
reduce the number of requests by aggregating more blobs, particularly in the byte store:
1. directories: upload whole Trees instead of directories recursively referencing more directories (Stu's idea in https://pantsbuild.slack.com/archives/C0D7TNJHL/p1684812802859609?thread_ts=1684809167.631159&cid=C0D7TNJHL)
2. file blobs: batch blobs into a single upload somehow, e.g. all (small-enough) descendants of a single directory, keyed by some aggregate digest of the directory and their parent (NB. this has to be an obvious/reproducible algorithm, so that the corresponding digest can be computed when reading, with less information available than when writing)
Magically increase GitHub's rate limits

This adds a bunch of additional metrics to provide more insight into the byte store behaviour: for each of the "read", "write" and "list missing digests" (aka "exists") operations, track: - number of attempts - number of successes (for "read", this is further split into "cached" and "uncached", i.e. whether the successful network request found data or not) - number of errors This mimics the metrics for the remote action cache. The minor shuffling of code in this PR also fixes a bug with the `RemoteStoreBlobBytesDownloaded` metric: before this PR, any successful network request would count as bytes downloaded, even if the digest didn't exist/wasn't downloaded (i.e. it successfully determined that the digest wasn't cached). This PR restricts the metric to only count digest bytes when the network request is successful _and_ the digest actually existed/was downloaded. I'm hopeful this will at least give more insight into the scale of the problem in #20133, in addition to being generally useful as "what's pants up to" reporting.

huonw · 2023-11-15T00:11:23Z

I've asked support and they said:

no way to increase rate limit
should ask at actions/cache for more info: How to do fine-grained caching: bulk APIs? actions/cache#1280

This takes the unusual step of _removing_ documentation. In particular, this removes the documentation for the still _very_ experimental new remote cache backends: GitHub Actions and file system, because they're useless and very unstable: - there's significant user-facing refactoring work that I'm not going to get to for 2.19 (#19902), particularly around changing how the options are configured - the GHA backend seems to be very... limited, and only works for small repos (#20133) I'm targeting 2.19, as it stabilises, and leaving the docs in `main` (for 2.20).

huonw · 2023-12-06T23:42:02Z

The #20138 metrics look like this in https://github.com/huonw/pants-cache-github-actions-poc/tree/huge. In particular, there's a lot of remote_store_exists_attempts that could potentially be skipped in favour of just optimistic reads/writes? I suspect it won't be enough, though.

  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 200
  remote_cache_requests_cached: 55
  remote_cache_requests_uncached: 138
  remote_cache_write_attempts: 145
  remote_cache_write_errors: 0
  remote_cache_write_successes: 15
  remote_store_exists_attempts: 7171
  remote_store_exists_errors: 0
  remote_store_exists_successes: 4444
  remote_store_missing_digest: 0
  remote_store_read_attempts: 2929
  remote_store_read_cached: 2315
  remote_store_read_errors: 0
  remote_store_read_uncached: 0
  remote_store_request_timeouts: 0
  remote_store_write_attempts: 3811
  remote_store_write_errors: 0
  remote_store_write_successes: 1222

…#20690) I think #20133 is basically a blocker for serious useful usage of the `remote_provider = "experimental-github-actions-cache"` backend, so it's worth calling out specifically, rather than forcing users to discover the errors and track down the issue themselves.

… (Cherry-pick of #20690) (#20692) I think #20133 is basically a blocker for serious useful usage of the `remote_provider = "experimental-github-actions-cache"` backend, so it's worth calling out specifically, rather than forcing users to discover the errors and track down the issue themselves. Co-authored-by: Huon Wilson <[email protected]>

danikv · 2024-05-20T08:33:40Z

Any updates ?
We are getting the rate limit response quite often
Failed to write to remote cache (1 occurrences so far): failed to write bytes to byte-store/f7/96/f796d90683a5d2eb70012e696ad1cc6f5abbc0cd45055a885f8c0869bec45c19: RateLimited (temporary) at Writer::close, context: { uri: https://acghubeus1.actions.githubusercontent.com/XOBKJTEeAYy204PqRtoOovZrR7W6T39UkmYry2K2ZH3xteFzM8/_apis/artifactcache/caches/283150, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Mon, 20 May 2024 08:30:54 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "6b4507dd-bb87-48e4-9172-1cc478e2d4e8", "activityid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-tfs-session": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-e2eid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Long", "x-ratel

huonw · 2024-05-20T11:37:21Z

No change that I know of.

huonw added bug remote labels Nov 1, 2023

huonw mentioned this issue Nov 1, 2023

Add additional metrics for remote byte stores #20138

Merged

This was referenced Nov 15, 2023

Remove documentation about new remote caching in 2.19 #20183

Merged

How to do fine-grained caching: bulk APIs? actions/cache#1280

Open

huonw mentioned this issue Mar 19, 2024

Link to rate limit issue in GitHub Actions remote cache provider docs #20690

Merged

WorkerPants mentioned this issue Mar 20, 2024

Link to rate limit issue in GitHub Actions remote cache provider docs (Cherry-pick of #20690) #20692

Merged

huonw mentioned this issue May 19, 2024

add init-pants action input.experimental-remote-cache-via-gha pantsbuild/actions#30

Merged

huonw mentioned this issue Jun 17, 2024

Experiment: Avoid preemtive cache search requests with GitHub Actions Cache provider #21078

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Actions Cache backend easily hits rate limit errors #20133

GitHub Actions Cache backend easily hits rate limit errors #20133

huonw commented Nov 1, 2023 •

edited

Loading

huonw commented Nov 1, 2023

huonw commented Nov 15, 2023

huonw commented Dec 6, 2023

danikv commented May 20, 2024

huonw commented May 20, 2024

GitHub Actions Cache backend easily hits rate limit errors #20133

GitHub Actions Cache backend easily hits rate limit errors #20133

Comments

huonw commented Nov 1, 2023 • edited Loading

huonw commented Nov 1, 2023

huonw commented Nov 15, 2023

huonw commented Dec 6, 2023

danikv commented May 20, 2024

huonw commented May 20, 2024

huonw commented Nov 1, 2023 •

edited

Loading