-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
airbyte-ci: better gradle caching #31535
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
This works locally but I haven't yet assessed its impact on source-postgres CI performance. I'm cautiously optimistic.
Second run: https://github.com/airbytehq/airbyte/actions/runs/6559827735
edit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a couple of clarification and would appreciate perceiving a performance boost before approving.
I believe you did not notice a perfomance boost because we have to enable remote caching of your new persistent volume here for cross runner remote cache.
My understanding so far is that:
- This change will allow us to cache java dependency across pipeline, avoiding the costly downloads.
- The gradle build cache is only used for the duration of a connector pipeline. Connectors sharing the same tasks won't benefit from increment build caching.
I'd like to understand the reasoning behind 2 a bit more. Do you plan on enabling the S3 remote cache to get a shared build cache?
"""This cache volume is for sharing gradle state across all pipeline runs.""" | ||
return self.context.dagger_client.cache_volume("gradle-persistent-cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual cross-pipeline-run persistency of this cache can be enabled by adding gradle-persistent-cache
to this list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be simpler to rename this volume to gradle-cache
then. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename to gradle-dependency-cache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll open a PR in airbyte-infra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@property | ||
def connector_transient_cache_volume(self) -> CacheVolume: | ||
"""This cache volume is for sharing gradle state across tasks within a single connector pipeline run.""" | ||
return self.context.dagger_client.cache_volume(f"gradle-{self.context.connector.technical_name}-transient-cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a pipeline identifier (like the git_revision
) to the cache volume name?
If dagger ever enable an automatic cross pipeline run persistence of the cache volume it will guarantee this volume will be only used for a single commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objections. We can discuss this offline but I was under the impression that mounting this volume as PRIVATE would be enough. On second thought, perhaps not.
def connector_transient_cache_volume(self) -> CacheVolume: | ||
"""This cache volume is for sharing gradle state across tasks within a single connector pipeline run.""" | ||
return self.context.dagger_client.cache_volume(f"gradle-{self.context.connector.technical_name}-transient-cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that using a "per pipeline" build cache will disable remote incremental build caching that can benefit multiple connectors on different pipeline.
E.G: with our current implementation I thought that if a pipeline builds the java base it will seed the cache and next pipelines (on other connectors) will hit the cache and won't rebuild the java base.
An alternative approach to support remote caching of incremental builds would be to use the gradle S3 remote cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this concern is applicable now that we have the CDK. The "java base" you mention is the CDK now and when building a connector we just pull the jars. Regarding more aggressive caching see my top-level response.
with_whole_git_repo = ( | ||
gradle_container_base | ||
# Mount the whole repo. | ||
.with_mounted_directory("/airbyte", self.context.get_repo_dir(".")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last time I tried mounting the whole repo without include
it was a very long operation when running locally, due to the number/size of files to mount to the dagger build context.
Can we only mount the "gradle projects"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience it wasn't noticeably long, like a couple of seconds. I didn't look into it too deeply. In any case, yes, let's only mount directories which either have a build.gradle file or recursively contain a subdirectory which contains a build.gradle file. Is there a shorthand for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No shorthand, probably a glob like here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a glob is going to cut it. Are you OK with me doing this change in a followup PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
# Mount the whole repo. | ||
.with_mounted_directory("/airbyte", self.context.get_repo_dir(".")) | ||
# Mount the cache volume for the gradle cache which is persisted throughout all pipeline runs. | ||
# We deliberately don't mount any cache volumes before mounting the git repo otherwise these will effectively be always empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it be empty as it's not sharing path with /airbyte
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't know why. Perhaps it's a dagger bug? In any case, if you mount the cache volume before the repo, that container layer gets aggressively cached by the dagger engine (as it should) and never gets rebuilt for subsequent airbyte-ci runs (which makes sense); but, somehow, this means that the cache volume is always empty on my laptop. Mounting it after the cache gets busted by a source file change makes it perform as expected.
Presumably, the same thing happens in our runners. It's weird and unexpected for sure.
.with_mounted_directory("/airbyte", self.context.get_repo_dir(".", include=include)) | ||
# Mount the sources for the connector and its dependencies. | ||
.with_mounted_directory(str(self.context.connector.code_directory), await self.context.get_connector_dir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to avoid doubling the mount
operation from host FS to container I think we can get the directory from the with_whole_git_repo
:
.with_mounted_directory("/airbyte", with_whole_git_repo.directory("/airbyte", include=include)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I hadn't thought about the performance impact of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below for why I couldn't do this in the end.
.with_mounted_directory("/airbyte", self.context.get_repo_dir(".", include=include)) | ||
# Mount the sources for the connector and its dependencies. | ||
.with_mounted_directory(str(self.context.connector.code_directory), await self.context.get_connector_dir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already have the connector code at this stage as we already mounted the full repo in the last with_mounted_directory
operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we deliberately don't base this container off the container with the whole repo mounted.
The reason for this is that mounting the whole repo is only necessary to, more or less, download all the jars and compile the CDK. These outputs change far less often than the git repo itself.
Consider the following scenario: you create a PR up which makes changes to two unrelated connectors. We'll have airbyte-ci run and the following will happen:
- All the layers in
with_whole_git_repo
need to be rebuilt; that's normal, the git repo is different than on master. - All the layers in
gradle_container
also need to be rebuilt, as expected.
Now, let's say you push a commit which only changes one of the two connectors and airbyte-ci
runs on the same runner:
- All the layers in
with_whole_git_repo
need to be rebuilt; that's normal, the git repo is different again. - The resulting
/root/.gradle
and/root/.m2
directors are rebuilt but they end up being the same as in the previous airbyte-ci run, because there were no changes to the dependencies and to the CDK sources. - For the connector which was not changed, the layers in
gradle_container
can be re-used.
What this means is that the layers for the connector container (with the partial git repo mount) are more likely to be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: this is mostly speculation on my part and this deserves a test. I'll try this out on my machine to confirm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it turns out, this speculation was wrong. I rewrote this a half-dozen times and still couldn't get /root/.gradle
to be reused, even when paring it down to /root/.gradle/caches/modules-2
as explained in https://docs.gradle.org/current/userguide/dependency_resolution.html#sub:cache_copy and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to get this to work in the end, but it required mounting connector sources from the host instead of from the other container!
# This will download gradle itself, a bunch of poms and jars, compile the gradle plugins, configure tasks, etc. | ||
.with_exec(self._get_gradle_command(":airbyte-cdk:java:airbyte-cdk:publishSnapshotIfNeeded")) | ||
# Mount the cache volume for the transient gradle cache used for this connector only. | ||
# This volume is PRIVATE meaning it exists only for the duration of the dagger pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dagger has a pipeline
terminology that differ from what we call a pipeline.
It's not clear to me if the PRIVATE
sharing means that the cache volume will exist for the duration of a dagger session or for the duration of a sub-pipeline.
We currently have Step subpipelines.
If private cache volume exist for the duration of a sub-pipeline it will defeat your purpose of sharing the cache across steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I did find the caching to work, but now I'm starting to thing that it only worked accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about writing an integration test that would help us actually test the caching behavior? Something using a dummy GradleTask
subclass and running gradle command on a dummy gradle project. It'd be helpful to validate our caching assumptions and prevent regressions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea. Do we already have similar kinds of integrations tests in airbyte-ci?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of. I this test I wrote some assertions about cache / cache bursting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks for this review!
Your understanding matches mine. I think that for CI purposes a clean build is always favourable. In other words, if we can build from source, and the source is in the repo, then we should. I think if we can get this scheme to work then we can do without the S3 cache. Configuring the gradle build and downloading the jars are what's expensive and we should be comfortable to cache these steps. |
The magicache cache sync isn't enabled yet so there's no point in running this more than once but here's a run on source-postgres: https://github.com/airbytehq/airbyte/actions/runs/6565091506 |
@postamar I re-ran the test pipeline: Expectation: this will seed the persistent dependency cache. We should trigger a second run a couple of minutes after it succeeds and check if we get performance boost: we expect the deps to be cached in the remote volume. First run duration: 23m 5s |
@alafanechere I actually opened up a PR based on this one to test these assumptions: #31580 I looked at your runs also. Something's still off. The dependency cache is handled correctly from what I can tell, but the transient cache isn't. I think I know why. |
I'm fed up with my own two-dagger-cache scheme. For a start it doesn't work. Furthermore I'm worried it's unmaintainable. It's certainly untestable. So I'm getting rid of it. |
…dle-caching' into better-gradle-caching
|
Step | Result |
---|---|
Build source-paypal-transaction docker image for platform(s) linux/x86_64 | ✅ |
Acceptance tests | ✅ |
Code format checks | ✅ |
Validate metadata for source-paypal-transaction | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-paypal-transaction test
|
Step | Result |
---|---|
Build source-paypal-transaction docker image for platform(s) linux/x86_64 | ✅ |
Acceptance tests | ✅ |
Code format checks | ✅ |
Validate metadata for source-paypal-transaction | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-paypal-transaction test
there is a setting in the build kit engine related to this. I believe it’s called keepBytes. I currently have it set to 50%, we could bump that up to 90% or so |
Indeed |
Running https://github.com/airbytehq/airbyte/actions/runs/6592254879 to validate. This should be good to go once https://github.com/airbytehq/airbyte-infra/pull/63 is approved, merged and deployed in prod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is about setting the env var on the container in the mounted_connector_secrets
function
airbyte-ci/connectors/pipelines/pipelines/dagger/actions/secrets.py
Outdated
Show resolved
Hide resolved
airbyte-ci/connectors/pipelines/pipelines/dagger/actions/secrets.py
Outdated
Show resolved
Hide resolved
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/steps/gradle.py
Outdated
Show resolved
Hide resolved
airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/steps/gradle.py
Outdated
Show resolved
Hide resolved
Let's see where this goes. |
Co-authored-by: postamar <[email protected]>
Informs #31439
This PR defines two cache volumes for java connector pipelines:
Care is taken to not bust the dagger container layer cache because updating the global cache requires mounting the whole git repo.