Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Actions Cache backend easily hits rate limit errors #20133

Open
huonw opened this issue Nov 1, 2023 · 5 comments
Open

GitHub Actions Cache backend easily hits rate limit errors #20133

huonw opened this issue Nov 1, 2023 · 5 comments

Comments

@huonw
Copy link
Contributor

huonw commented Nov 1, 2023

Describe the bug

The new experimental remote cache backend for github actions (#19831) seems to easily hit rate limits in practice, presumably because it's designed for coarse grained caching of larger artefacts rather than the very fine grained caching of individual items that Pants uses it for.

For instance (full messages below):

[WARN] Failed to read from remote cache (1 occurrences so far): failed to read ...: RateLimited (persistent) at read, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

[WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to ...: RateLimited (persistent) at write, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

Click to expand for example full error messages

read:

09:55:54.25 [WARN] Failed to read from remote cache (1 occurrences so far): failed to read huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a: RateLimited (persistent) at read, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/cache?keys=huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a&version=pants-1, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:55:53 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "176cbe80-c272-43ec-99d8-58b95bda99a0", "activityid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-tfs-session": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-e2eid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "1623", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, service: ghac, path: huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a, range: 0- } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

write:

09:56:01.76 [WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b: RateLimited (persistent) at write, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/caches, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:56:01 GMT", "server": "Kestrel", "retry-after": "286", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "1fb111d5-08ce-442f-ad9a-1c9c1c64f68f", "activityid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-tfs-session": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-e2eid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "734", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, called: Backend::ghac_reserve, service: ghac, path: huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

(NB. inline code formatting to get better line breaking)

This means it seems to work well for small examples, but falls over as they become larger. Thus, mitigations are likely required.

For instance, https://github.com/huonw/pants-cache-github-actions-poc/tree/cf5c53b368f0b5229f103465f25005b6db465988 (on the huge branch) sets up 100 codegen targets, that generate 100 files each, and then a corresponding test so that pants test :: will force the codegen targets to do something. This means 200 process executions and 10k files need to be uploaded or downloaded from the GHA cache in quick succession.

A build that has a (partially) seeded remote cache quickly hits errors: https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6718179818/job/18257496697

Setting remote_store_rpc_concurrency = 1 (down from the default of 128) didn't stop this occurring in my work repo, i.e. maxed-out sequential interactions seemed to be enough to hit the rate limits too. (It did appear to help with the reduced example above.)

Picking out headers from those messages that might seem relevant:

  • The x-ratelimit-limit header has values 1623, 734. I've seen it as high as ~3000 in my work repo.
  • The x-ratelimit-remaining header unsurprisingly always has value 0
  • The x-ratelimit-reset (unix timestamp?) and retry-after (seconds to wait?) headers don't exactly align, but they're both order of minutes to wait, e.g. for the first one, the request ran at 2023-11-01T09:55:53+00:00 and the rate limiting says:
    • x-ratelimit-reset: 2023-11-01T09:58:37+00:00 (datetime.fromtimestamp(1698832717).astimezone(timezone.utc).isoformat())
    • date + retry-after: 2023-11-01T09:56:53+00:00 ((datetime(2023, 11, 1, 9, 55, 53, tzinfo=timezone.utc) + timedelta(seconds=60)).isoformat())

Info about rate limits and other similar tools experience:

Pants version
2.19.0.dev4

OS
GHA ubuntu runner

Additional info
N/A

@huonw
Copy link
Contributor Author

huonw commented Nov 1, 2023

Brainstorming some potential ways to improve this:

  1. use the "GET API" somehow (no idea how, yet)
  2. reduce the number of requests by aggregating more blobs, particularly in the byte store:
    1. directories: upload whole Trees instead of directories recursively referencing more directories (Stu's idea in https://pantsbuild.slack.com/archives/C0D7TNJHL/p1684812802859609?thread_ts=1684809167.631159&cid=C0D7TNJHL)
    2. file blobs: batch blobs into a single upload somehow, e.g. all (small-enough) descendants of a single directory, keyed by some aggregate digest of the directory and their parent (NB. this has to be an obvious/reproducible algorithm, so that the corresponding digest can be computed when reading, with less information available than when writing)
  3. Magically increase GitHub's rate limits

huonw added a commit that referenced this issue Nov 7, 2023
This adds a bunch of additional metrics to provide more insight into the
byte store behaviour: for each of the "read", "write" and "list missing
digests" (aka "exists") operations, track:

- number of attempts
- number of successes (for "read", this is further split into "cached"
and "uncached", i.e. whether the successful network request found data
or not)
- number of errors

This mimics the metrics for the remote action cache.

The minor shuffling of code in this PR also fixes a bug with the
`RemoteStoreBlobBytesDownloaded` metric: before this PR, any successful
network request would count as bytes downloaded, even if the digest
didn't exist/wasn't downloaded (i.e. it successfully determined that the
digest wasn't cached). This PR restricts the metric to only count digest
bytes when the network request is successful _and_ the digest actually
existed/was downloaded.

I'm hopeful this will at least give more insight into the scale of the
problem in #20133, in addition to being generally useful as "what's
pants up to" reporting.
@huonw
Copy link
Contributor Author

huonw commented Nov 15, 2023

I've asked support and they said:

huonw added a commit that referenced this issue Nov 15, 2023
This takes the unusual step of _removing_ documentation. In particular,
this removes the documentation for the still _very_ experimental new
remote cache backends: GitHub Actions and file system, because they're
useless and very unstable:

- there's significant user-facing refactoring work that I'm not going to
get to for 2.19 (#19902),
particularly around changing how the options are configured
- the GHA backend seems to be very... limited, and only works for small
repos (#20133)

I'm targeting 2.19, as it stabilises, and leaving the docs in `main`
(for 2.20).
@huonw
Copy link
Contributor Author

huonw commented Dec 6, 2023

The #20138 metrics look like this in https://github.com/huonw/pants-cache-github-actions-poc/tree/huge. In particular, there's a lot of remote_store_exists_attempts that could potentially be skipped in favour of just optimistic reads/writes? I suspect it won't be enough, though.

  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 200
  remote_cache_requests_cached: 55
  remote_cache_requests_uncached: 138
  remote_cache_write_attempts: 145
  remote_cache_write_errors: 0
  remote_cache_write_successes: 15
  remote_store_exists_attempts: 7171
  remote_store_exists_errors: 0
  remote_store_exists_successes: 4444
  remote_store_missing_digest: 0
  remote_store_read_attempts: 2929
  remote_store_read_cached: 2315
  remote_store_read_errors: 0
  remote_store_read_uncached: 0
  remote_store_request_timeouts: 0
  remote_store_write_attempts: 3811
  remote_store_write_errors: 0
  remote_store_write_successes: 1222

huonw added a commit that referenced this issue Mar 20, 2024
…#20690)

I think #20133 is basically a blocker for serious useful usage of the
`remote_provider = "experimental-github-actions-cache"` backend, so it's
worth calling out specifically, rather than forcing users to discover
the errors and track down the issue themselves.
WorkerPants pushed a commit that referenced this issue Mar 20, 2024
…#20690)

I think #20133 is basically a blocker for serious useful usage of the
`remote_provider = "experimental-github-actions-cache"` backend, so it's
worth calling out specifically, rather than forcing users to discover
the errors and track down the issue themselves.
huonw added a commit that referenced this issue Mar 20, 2024
… (Cherry-pick of #20690) (#20692)

I think #20133 is basically a blocker for serious useful usage of the
`remote_provider = "experimental-github-actions-cache"` backend, so it's
worth calling out specifically, rather than forcing users to discover
the errors and track down the issue themselves.

Co-authored-by: Huon Wilson <[email protected]>
@danikv
Copy link

danikv commented May 20, 2024

Any updates ?
We are getting the rate limit response quite often
Failed to write to remote cache (1 occurrences so far): failed to write bytes to byte-store/f7/96/f796d90683a5d2eb70012e696ad1cc6f5abbc0cd45055a885f8c0869bec45c19: RateLimited (temporary) at Writer::close, context: { uri: https://acghubeus1.actions.githubusercontent.com/XOBKJTEeAYy204PqRtoOovZrR7W6T39UkmYry2K2ZH3xteFzM8/_apis/artifactcache/caches/283150, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Mon, 20 May 2024 08:30:54 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "6b4507dd-bb87-48e4-9172-1cc478e2d4e8", "activityid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-tfs-session": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-e2eid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Long", "x-ratel

@huonw
Copy link
Contributor Author

huonw commented May 20, 2024

No change that I know of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants