Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Actions cache is surprisingly slow #1485

Open
indygreg opened this issue Dec 18, 2022 · 26 comments
Open

GitHub Actions cache is surprisingly slow #1485

indygreg opened this issue Dec 18, 2022 · 26 comments

Comments

@indygreg
Copy link
Contributor

I was excited to see the introduction of the GitHub Actions cache backend. However, when I tried it on a test branch, the results were surprisingly slow.

See the initial run at https://github.com/indygreg/apple-platform-rs/actions/runs/3723171428/jobs/6314513850. Here, a build job took ~1 hour. (It normally takes ~10m.) The Stop sccache step displays sccache server stats. This particular job shows 44 cache errors. Unsure if that is possibly the cause of problems?

I did another CI run where I cranked up logging to debug and printed the log file after the build. See e.g. https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814370 for a slow build. Interestingly, some of the jobs in this run were fast! See https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814955 for a fast one. Spot checking the logs, I think there is a positive correlation between the cache errors count and job time.

Poking around the logs, I see occurrences of DEBUG sccache::compiler::compiler] [aws_http]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}. So I guess my project is too big to use GitHub Actions cache???

More worrisome though is that some of the cache writes take 10+s. e.g.

2022-12-18T05:13:25.3938134Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] reuse idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3938390Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] flushed 2318 bytes
2022-12-18T05:13:25.3938682Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-18T05:13:25.3939035Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-18T05:13:25.3939330Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x3 (3 bytes)
2022-12-18T05:13:25.3939686Z [2022-12-18T05:08:46Z WARN  reqwest_retry::middleware] Retry attempt #1. Sleeping 53.097849ms before the next attempt
2022-12-18T05:13:25.3940103Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] reuse idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3940353Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] flushed 2318 bytes
2022-12-18T05:13:25.3940598Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-18T05:13:25.3940878Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-18T05:13:25.3941175Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x3 (3 bytes)
2022-12-18T05:13:25.3941508Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; otel.status_code="ERROR"
2022-12-18T05:13:25.3941845Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=429
2022-12-18T05:13:25.3942174Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-18T05:13:25.3942478Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)
2022-12-18T05:13:25.3942793Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body completed
2022-12-18T05:13:25.3943218Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3944403Z [2022-12-18T05:08:46Z DEBUG sccache::compiler::compiler] [aws_sdk_s3]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
2022-12-18T05:13:25.3944708Z [2022-12-18T05:08:46Z DEBUG sccache::server] [aws_sdk_s3]: Cache write finished in 21.388 s
2022-12-18T05:13:25.3945012Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)
2022-12-18T05:13:25.3945338Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)

And

2022-12-18T05:13:25.3908847Z [2022-12-18T05:08:38Z DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3909181Z [2022-12-18T05:08:38Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=204
2022-12-18T05:13:25.3909491Z [2022-12-18T05:08:38Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-18T05:13:25.3909910Z [2022-12-18T05:08:38Z DEBUG sccache::compiler::compiler] [aws_config]: Stored in cache successfully!
2022-12-18T05:13:25.3910204Z [2022-12-18T05:08:38Z DEBUG sccache::server] [aws_config]: Cache write finished in 13.018 s
2022-12-18T05:13:25.3910535Z [2022-12-18T05:08:45Z DEBUG sccache::compiler::compiler] [aws_sdk_s3]: Compiled in 21.388 s, storing in cache
2022-12-18T05:13:25.3910931Z [2022-12-18T05:08:46Z INFO  gha_toolkit::cache] put; version="5c35548282b30b97a2314eadc6fc42b9b7585a12c4f9efa527ac2e48739babe0"
2022-12-18T05:13:25.3911660Z [2022-12-18T05:08:46Z INFO  gha_toolkit::cache] reserve; key="sccache-ubuntu-22.04-x86_64-unknown-linux-gnu-1.62.1-0300a5abb7bd36eb7e4aa0b971d5e94866c6e140" version="7c4ed3dc414c5c2d030ec00d8394e733c5dfeff8cdbae21becc2144de0bb9570" cache_size=21844888
2022-12-18T05:13:25.3912483Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.method=POST http.scheme=https http.host=artifactcache.actions.githubusercontent.com net.host.port=0 otel.kind="client" otel.name=POST /LNWNXsg1z0Qt0Mn41fuRdbA1YzcaL5Z2Wf2AYmEQiLGLuQCklb/_apis/artifactcache/caches

I'm not sure how to interpret these logs. But multi-second values for Cache write finished in seem wrong: I would expect a cache write to effectively be an HTTP request and for inserts to take dozens to hundreds of milliseconds. But maybe the GitHub Actions cache latency is just really high?

I'm unsure what actions are needed for this issue. I'm unsure if there's any code-level bugs. Maybe GitHub Actions cache is just prohibitively slow? Maybe this slowness/behavior should be documented?

@sylvestre
Copy link
Collaborator

Salut @indygreg nice to see you here :)

@jakelee8 does it ring a bell?

@sylvestre
Copy link
Collaborator

More worrisome though is that some of the cache writes take 10+s. e.g.

I might be wrong but I think it also includes the compilation time.
Side note: maybe we should have compilation time + cache time displayed in the logs

@jakelee8
Copy link
Contributor

The 429 Too Many Requests error is a legitimate issue. Do y'all think it would help to implement a cache write queue to limit parallelism?

From what I've seen, it looks like compile time is included in the write time. Cache reads are fast at 0.000 s. Here's an example of one of our internal builds:

sccache 0.3.1
Compile requests                    722
Compile requests executed           608
Cache hits                          539
Cache hits (C/C++)                  100
Cache hits (Rust)                   439
Cache misses                         65
Cache misses (C/C++)                 15
Cache misses (Rust)                  50
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  4
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 114
Non-compilation calls                 0
Unsupported compiler calls            0
Average cache write               1.379 s
Average cache read miss           1.379 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                           83
unknown source language              24
-                                     4
-E                                    2
argument parse                        1

@jakelee8
Copy link
Contributor

Also, did you set the environmental variable CARGO_INCREMENTAL=false?

From the README:

Incrementally compiled crates cannot be cached. By default, in the debug profile Cargo will use incremental compilation for workspace members and path dependencies. You can disable incremental compilation.

https://github.com/mozilla/sccache#rust

@glandium
Copy link
Collaborator

More worrisome though is that some of the cache writes take 10+s. e.g.

FWIW, these logs are wrong. PR #1495 will hopefully fix that.

@indygreg
Copy link
Contributor Author

I do not set CARGO_INCREMENTAL=false in these CI jobs. Maybe that explains some regressions in caching in recent months?!

(Also, I'm not receiving emails for updates in this repository for some reason. I suspect they may be getting routed to my defunct mozilla.com email. But I don't see any references to that email in my GitHub account. So unsure what's going on. Apologies if I'm tardy to reply - I probably didn't see the activity!)

@rajivshah3
Copy link

rajivshah3 commented Dec 30, 2022

I'm seeing similar issues with HTTP error 429, and can confirm that we're setting CARGO_INCREMENTAL=0:

2022-12-22:13: DEBUG reqwest::connect] starting new connection: https://artifactcache.actions.githubusercontent.com/
2022-12-22:13: DEBUG hyper::client::connect::dns] resolving host="artifactcache.actions.githubusercontent.com"
2022-12-22:13: DEBUG hyper::client::connect::http] connecting to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::client::connect::http] connected to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::proto::h1::io] flushed 2214 bytes
2022-12-22:13: DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 3 (3 bytes)
2022-12-22:13: WARN  reqwest_retry::middleware] Retry attempt #1. Sleeping 265.s before the next attempt
2022-12-22:13: DEBUG reqwest::connect] starting new connection: https://artifactcache.actions.githubusercontent.com/
2022-12-22:13: DEBUG hyper::client::connect::dns] resolving host="artifactcache.actions.githubusercontent.com"
2022-12-22:13: DEBUG hyper::client::connect::http] connecting to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::client::connect::http] connected to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::proto::h1::io] flushed 2214 bytes
2022-12-22:13: DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 3 (3 bytes)
2022-12-22:13: INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; otel.status_code="ERROR"
2022-12-22:13: INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=429
2022-12-22:13: INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body completed
2022-12-22:13: DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-22:13: DEBUG sccache::compiler::compiler] [unicode_ident]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
2022-12-22:13: DEBUG sccache::server] [unicode_ident]: Cache write finished in 2.982 s

I'm surprised that it still reports that the cache write is finished though 🤔

@sylvestre
Copy link
Collaborator

@rajivshah3 in the future, please strip the color string, it makes the log harder to read ;)

For cache write, we don't need to be super fast. Maybe a queue would do.
@Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 2, 2023

@Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)

Sure! I will add rate limit handling after implement apache/opendal#1111

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 3, 2023

apache/opendal#1111 has passed all opendal's tests now! It will come into sccache soon.

@sylvestre
Copy link
Collaborator

@Xuanwo you are auto approving your patches in opendal ?

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 3, 2023

@Xuanwo you are auto approving your patches in opendal ?

Yep, most PRs are merged by me after all check passed.

@sylvestre
Copy link
Collaborator

ok :(
this is a bit surprising ...

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 3, 2023

ok :(
this is a bit surprising ...

I expect to improve this part while we have more maintainers~

@sylvestre
Copy link
Collaborator

yes, please do, this should not be done on important project

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 9, 2023

Hi, @indygreg, I re-produced you problem with the v0.4.0-pre.5 release at https://github.com/Xuanwo/apple-platform-rs/actions/runs/3871176193/jobs/6598713731

The real problem here is ghac will return a very long Retry-After which could up to 60s. So if we follow this limit strictly, we will observe the surprisingly slow.

The full error message here:

sccache: error: Server startup failed: cache storage failed to write: Unexpected (permanent) at write => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

Context:
    response: Parts { status: 429, version: HTTP/1.1, headers: {"transfer-encoding": "chunked", "content-type": "application/json", "retry-after": "60", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "daad0d22-be5d-4106-9f45-205eb7eedac3", "activityid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-tfs-session": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-e2eid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "758", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1673244953", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27.", "x-cache": "CONFIG_NOCACHE", "x-msedge-ref": "Ref A: 489C21A91BE741E29D754583C16E0C7C Ref B: PAOEDGE0516 Ref C: 2023-01-09T06:11:06Z", "date": "Mon, 09 Jan 2023 06:11:06 GMT"} }
    called: Backend::ghac_reserve
    service: ghac
    path: .sccache_check

Returns a cache miss or give up this cache write maybe a better choice.

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 12, 2023

Hi, @indygreg. Here are our updates on addressing this issue.

I have a test with my two patches: #1557 & #1550 on apple-platform-rs.

The biggest lesson I have learned from this is that sccache is a cache service, and we don't need to ensure every write/read succeeds. So instead of waiting to retry limited requests, we will skip this step and keep going.

Here are the differences:

Before

image

After

No failed build anymore

image

The action is here Xuanwo/apple-platform-rs#1

@indygreg
Copy link
Contributor Author

(I'm still not receiving email notifications from Mozilla orgs due to a GitHub bug involving an orphaned email forwarding rule leftover from when I worked at Mozilla.)

The updates here look great and I hope to test out the GHA backend again soon. The new times from the previous comment appear no worse than what I'm getting with 0.3.3 on S3 today. Considering GHA enables me to enable sccache on PRs without leaking my cloud provider credentials, this looks like a very compelling feature now! Thanks for all your work!

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 16, 2023

@indygreg Please take v0.4.0-pre.6 a try~

@rajivshah3
Copy link

Hi @Xuanwo , is there anything that can be done to improve this? On v0.4.0-pre.6 we're still seeing a lot of 429 errors in the logs (on both reads and writes) and there isn't much of a difference in compilation time (33 min uncached vs 29 min cached). The logs from sccache are uploaded as build artifacts if that would help. I wonder if it would be worth having an opt-in feature to try the failed writes again in the background once the compilation is done?

@Xuanwo
Copy link
Collaborator

Xuanwo commented Jan 27, 2023

On v0.4.0-pre.6 we're still seeing a lot of 429 errors in the logs (on both reads and writes)

This is the github's limit and I don't how to overcome them so far.

here isn't much of a difference in compilation time (33 min uncached vs 29 min cached).

Sadly, sccache can't cache the heaviest dep rocksdb in your project. Based on my own experience, a full release build of rocksdb will always take from 20 min to 30 min. Maybe you can try run tests in debug build instead of release build.

I wonder if it would be worth having an opt-in feature to try the failed writes again in the background once the compilation is done?

The time could be very long(

@onbjerg
Copy link

onbjerg commented Feb 14, 2023

Hi! Did the caching behavior change a lot from 0.3.3 to 0.4.0-pre.7? Locally, running

cargo llvm-cov nextest --lcov --output-path lcov.info --locked --workspace --all-features --partition hash:1/3

Yields these stats:

Compile requests                   1359
Compile requests executed          1033
Cache hits                          791
Cache hits (C/C++)                   46
Cache hits (Rust)                   745
Cache misses                        238
Cache misses (C/C++)                  2
Cache misses (Rust)                 236
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  4
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 314
Non-compilation calls                12
Unsupported compiler calls            0
Average cache write               0.857 s
Average cache read miss           0.857 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                          280
unknown source language              22
-                                     8
argument parse                        3
missing input                         1

Cache location                  Local disk: "/home/oliver/.cache/sccache"
Cache size                           10 GiB
Max cache size                       20 GiB

But on CI (see paradigmxyz/reth#1355) these numbers are vastly different:

Compile requests                    733
Compile requests executed           537
Cache hits                          300
Cache hits (C/C++)                    4
Cache hits (Rust)                   296
Cache misses                        235
Cache misses (C/C++)                 20
Cache misses (Rust)                 215
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  2
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 189
Non-compilation calls                 7
Unsupported compiler calls            0
Average cache write               0.078 s
Average compiler                  0.947 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                          168
unknown source language              11
-                                     6
argument parse                        3
missing input                         1

Cache location                  ghac, name: sccache-v0.4.0-pre.7-sccache, prefix: /sccache/
Version (client)                0.4.0-pre.7

The cache seems to be populated, writes and reads are reported as pretty fast, but the build is a lot slower than using rust-cache - some of our jobs went from 5 mins build time to 12-13 mins (see main CI).

I based the workflow changes off of Xuanwo/apple-platform-rs#1

@Xuanwo
Copy link
Collaborator

Xuanwo commented Feb 15, 2023

Hi! Did the caching behavior change a lot from 0.3.3 to 0.4.0-pre.7? Locally, running

I have tried this command over sccache v0.3.3 but get:

Stopping sccache server...
Compile requests                    733
Compile requests executed           537
Cache hits                            0
Cache misses                        535
Cache misses (C/C++)                 24
Cache misses (Rust)                 511
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  2
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 189
Non-compilation calls                 7
Unsupported compiler calls            0
Average cache write               0.776 s
Average cache read miss           0.776 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                          168
unknown source language              11
-                                     6
argument parse                        3
missing input                         1

Cache location                  Local disk: "/home/xuanwo/.cache/sccache"
Cache size                          680 MiB
Max cache size                       10 GiB

The compile requests is the same. Are you running the same command? By the way, maybe it's better to build tests with sccache, upload to artifacts, and than running in different hash part.

@onbjerg
Copy link

onbjerg commented Feb 16, 2023

@Xuanwo Unfortunately using artifacts is a lot slower since the artifact upload speed is abysmal :( The artifact upload takes as long (or longer) as the build does.

I am running the same command locally, I'll try a fresh clone, maybe something is messed with my local copy of the repository I am building

Do you know/have any pointers on what could make sccache slower than rust-cache for the linked PR?

@Xuanwo
Copy link
Collaborator

Xuanwo commented Feb 16, 2023

Do you know/have any pointers on what could make sccache slower than rust-cache for the linked PR?

No ideas so far. Maybe we are hitting the github action cache's rate limit.

@onbjerg
Copy link

onbjerg commented Feb 17, 2023

I thought so too, but sccace --show-stats does not show any read or write errors. Maybe they are not reported for GHAC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@indygreg @sylvestre @glandium @Xuanwo @onbjerg @jakelee8 @rajivshah3 and others