airbyte-ci: better gradle caching #31535

postamar · 2023-10-18T02:56:22Z

This PR defines two cache volumes for java connector pipelines:

one which is persisted globally and which is used to cache downloaded jars and poms, and not more, because we don't trust gradle too much and would prefer clean builds from source;
another which is private to the pipeline and whose purpose is to avoid duplication of effort during distTar, test and integrationTest.

Care is taken to not bust the dagger container layer cache because updating the global cache requires mounting the whole git repo.

vercel · 2023-10-18T02:56:29Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Oct 23, 2023 2:57pm

postamar · 2023-10-18T03:19:09Z

This works locally but I haven't yet assessed its impact on source-postgres CI performance. I'm cautiously optimistic.
First run: https://github.com/airbytehq/airbyte/actions/runs/6555788875

╭─────────────────────── SOURCE-POSTGRES - TEST RESULTS ───────────────────────╮
│                                Steps results                                 │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Step                                             ┃ Result     ┃ Duration ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ Build connector tar                              │ Successful │ 04mn31s  │ │
│ │ Java Connector Unit Tests                        │ Successful │ 16mn16s  │ │
│ │ Build source-*** docker image for           │ Successful │ 16.04s   │ │
│ │ platform(s) linux/x86_64                         │            │          │ │
│ │ Acceptance tests                                 │ Successful │ 02mn52s  │ │
│ │ Java Connector Integration Tests                 │ Successful │ 06mn32s  │ │
│ │ Validate metadata for source-***            │ Successful │ 31.09s   │ │
│ │ Connector version semver check                   │ Successful │ 0.30s    │ │
│ │ Connector version increment check                │ Skipped    │ 0.00s    │ │
│ │ QA checks                                        │ Successful │ 39.29s   │ │
│ └──────────────────────────────────────────────────┴────────────┴──────────┘ │
│ ℹ️  You can find more details with step executions logs in the saved HTML     │
│ report.                                                                      │
╰────────── ⏲️  Total pipeline duration for source-***: 21mn12s ───────────╯

Second run: https://github.com/airbytehq/airbyte/actions/runs/6559827735

╭─────────────────────── SOURCE-POSTGRES - TEST RESULTS ───────────────────────╮
│                                Steps results                                 │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Step                                             ┃ Result     ┃ Duration ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ Build connector tar                              │ Successful │ 03mn56s  │ │
│ │ Java Connector Unit Tests                        │ Successful │ 15mn59s  │ │
│ │ Build source-*** docker image for           │ Successful │ 10.02s   │ │
│ │ platform(s) linux/x86_64                         │            │          │ │
│ │ Acceptance tests                                 │ Successful │ 02mn52s  │ │
│ │ Java Connector Integration Tests                 │ Successful │ 06mn21s  │ │
│ │ Validate metadata for source-***            │ Successful │ 4.04s    │ │
│ │ Connector version semver check                   │ Successful │ 0.32s    │ │
│ │ Connector version increment check                │ Skipped    │ 0.00s    │ │
│ │ QA checks                                        │ Successful │ 3.28s    │ │
│ └──────────────────────────────────────────────────┴────────────┴──────────┘ │
│ ℹ️  You can find more details with step executions logs in the saved HTML     │
│ report.                                                                      │
╰────────── ⏲️  Total pipeline duration for source-***: 20mn03s ───────────╯

edit: ~~unsurprisingly the second run didn't run on the same runner as the first so it started with a cold cache again. The penalty for this is fairly low, on the order of 2m30.~~ (edit edit: I'm less sure now) A couple of days after merging this change we'll want to check that the caches are usually warm. This is the case when running airbyte-ci locally so I'm not worried about the fact that the cache works. Note that prior to this change, it didn't (because we mounted the cache volume before mounting the source files)

alafanechere

I need a couple of clarification and would appreciate perceiving a performance boost before approving.
I believe you did not notice a perfomance boost because we have to enable remote caching of your new persistent volume here for cross runner remote cache.

My understanding so far is that:

This change will allow us to cache java dependency across pipeline, avoiding the costly downloads.
The gradle build cache is only used for the duration of a connector pipeline. Connectors sharing the same tasks won't benefit from increment build caching.

I'd like to understand the reasoning behind 2 a bit more. Do you plan on enabling the S3 remote cache to get a shared build cache?

alafanechere · 2023-10-18T10:59:38Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

+        """This cache volume is for sharing gradle state across all pipeline runs."""
+        return self.context.dagger_client.cache_volume("gradle-persistent-cache")


The actual cross-pipeline-run persistency of this cache can be enabled by adding gradle-persistent-cache to this list

Perhaps it would be simpler to rename this volume to gradle-cache then. WDYT?

Can we rename to gradle-dependency-cache ?

OK I'll open a PR in airbyte-infra.

https://github.com/airbytehq/airbyte-infra/pull/62

alafanechere · 2023-10-18T11:01:39Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

+    @property
+    def connector_transient_cache_volume(self) -> CacheVolume:
+        """This cache volume is for sharing gradle state across tasks within a single connector pipeline run."""
+        return self.context.dagger_client.cache_volume(f"gradle-{self.context.connector.technical_name}-transient-cache")


Can we add a pipeline identifier (like the git_revision) to the cache volume name?
If dagger ever enable an automatic cross pipeline run persistence of the cache volume it will guarantee this volume will be only used for a single commit.

I have no objections. We can discuss this offline but I was under the impression that mounting this volume as PRIVATE would be enough. On second thought, perhaps not.

alafanechere · 2023-10-18T11:06:58Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

+    def connector_transient_cache_volume(self) -> CacheVolume:
+        """This cache volume is for sharing gradle state across tasks within a single connector pipeline run."""
+        return self.context.dagger_client.cache_volume(f"gradle-{self.context.connector.technical_name}-transient-cache")


I'm concerned that using a "per pipeline" build cache will disable remote incremental build caching that can benefit multiple connectors on different pipeline.
E.G: with our current implementation I thought that if a pipeline builds the java base it will seed the cache and next pipelines (on other connectors) will hit the cache and won't rebuild the java base.

An alternative approach to support remote caching of incremental builds would be to use the gradle S3 remote cache.

I don't think this concern is applicable now that we have the CDK. The "java base" you mention is the CDK now and when building a connector we just pull the jars. Regarding more aggressive caching see my top-level response.

alafanechere · 2023-10-18T11:10:59Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

+        with_whole_git_repo = (
+            gradle_container_base
+            # Mount the whole repo.
+            .with_mounted_directory("/airbyte", self.context.get_repo_dir("."))


The last time I tried mounting the whole repo without include it was a very long operation when running locally, due to the number/size of files to mount to the dagger build context.
Can we only mount the "gradle projects"?

In my experience it wasn't noticeably long, like a couple of seconds. I didn't look into it too deeply. In any case, yes, let's only mount directories which either have a build.gradle file or recursively contain a subdirectory which contains a build.gradle file. Is there a shorthand for this?

No shorthand, probably a glob like here

I don't think a glob is going to cut it. Are you OK with me doing this change in a followup PR?

alafanechere · 2023-10-18T11:12:42Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

+            # Mount the whole repo.
+            .with_mounted_directory("/airbyte", self.context.get_repo_dir("."))
+            # Mount the cache volume for the gradle cache which is persisted throughout all pipeline runs.
+            # We deliberately don't mount any cache volumes before mounting the git repo otherwise these will effectively be always empty.


Why would it be empty as it's not sharing path with /airbyte?

I honestly don't know why. Perhaps it's a dagger bug? In any case, if you mount the cache volume before the repo, that container layer gets aggressively cached by the dagger engine (as it should) and never gets rebuilt for subsequent airbyte-ci runs (which makes sense); but, somehow, this means that the cache volume is always empty on my laptop. Mounting it after the cache gets busted by a source file change makes it perform as expected.

Presumably, the same thing happens in our runners. It's weird and unexpected for sure.

alafanechere · 2023-10-18T11:17:16Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

            .with_mounted_directory("/airbyte", self.context.get_repo_dir(".", include=include))
+            # Mount the sources for the connector and its dependencies.
            .with_mounted_directory(str(self.context.connector.code_directory), await self.context.get_connector_dir())


In order to avoid doubling the mount operation from host FS to container I think we can get the directory from the with_whole_git_repo:
.with_mounted_directory("/airbyte", with_whole_git_repo.directory("/airbyte", include=include)

Good idea! I hadn't thought about the performance impact of this.

See comment below for why I couldn't do this in the end.

alafanechere · 2023-10-18T11:18:08Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

            .with_mounted_directory("/airbyte", self.context.get_repo_dir(".", include=include))
+            # Mount the sources for the connector and its dependencies.
            .with_mounted_directory(str(self.context.connector.code_directory), await self.context.get_connector_dir())


Don't we already have the connector code at this stage as we already mounted the full repo in the last with_mounted_directory operation?

No, we deliberately don't base this container off the container with the whole repo mounted.

The reason for this is that mounting the whole repo is only necessary to, more or less, download all the jars and compile the CDK. These outputs change far less often than the git repo itself.

Consider the following scenario: you create a PR up which makes changes to two unrelated connectors. We'll have airbyte-ci run and the following will happen:

All the layers in with_whole_git_repo need to be rebuilt; that's normal, the git repo is different than on master.

All the layers in gradle_container also need to be rebuilt, as expected.

Now, let's say you push a commit which only changes one of the two connectors and airbyte-ci runs on the same runner:

All the layers in with_whole_git_repo need to be rebuilt; that's normal, the git repo is different again.

The resulting /root/.gradle and /root/.m2 directors are rebuilt but they end up being the same as in the previous airbyte-ci run, because there were no changes to the dependencies and to the CDK sources.

For the connector which was not changed, the layers in gradle_container can be re-used.

What this means is that the layers for the connector container (with the partial git repo mount) are more likely to be reused.

To be clear: this is mostly speculation on my part and this deserves a test. I'll try this out on my machine to confirm!

As it turns out, this speculation was wrong. I rewrote this a half-dozen times and still couldn't get /root/.gradle to be reused, even when paring it down to /root/.gradle/caches/modules-2 as explained in https://docs.gradle.org/current/userguide/dependency_resolution.html#sub:cache_copy and below.

I was able to get this to work in the end, but it required mounting connector sources from the host instead of from the other container!

alafanechere · 2023-10-18T11:23:13Z

airbyte-ci/connectors/pipelines/pipelines/gradle.py

-            # This will download gradle itself, a bunch of poms and jars, compile the gradle plugins, configure tasks, etc.
-            .with_exec(self._get_gradle_command(":airbyte-cdk:java:airbyte-cdk:publishSnapshotIfNeeded"))
+            # Mount the cache volume for the transient gradle cache used for this connector only.
+            # This volume is PRIVATE meaning it exists only for the duration of the dagger pipeline.


Dagger has a pipeline terminology that differ from what we call a pipeline.

It's not clear to me if the PRIVATE sharing means that the cache volume will exist for the duration of a dagger session or for the duration of a sub-pipeline.
We currently have Step subpipelines.
If private cache volume exist for the duration of a sub-pipeline it will defeat your purpose of sharing the cache across steps.

Interesting. I did find the caching to work, but now I'm starting to thing that it only worked accidentally.

What do you think about writing an integration test that would help us actually test the caching behavior? Something using a dummy GradleTask subclass and running gradle command on a dummy gradle project. It'd be helpful to validate our caching assumptions and prevent regressions...

That sounds like a good idea. Do we already have similar kinds of integrations tests in airbyte-ci?

Kind of. I this test I wrote some assertions about cache / cache bursting.

postamar · 2023-10-18T13:19:45Z

Thanks for this review!

My understanding so far is that: ...

Your understanding matches mine. I think that for CI purposes a clean build is always favourable. In other words, if we can build from source, and the source is in the repo, then we should.

I think if we can get this scheme to work then we can do without the S3 cache. Configuring the gradle build and downloading the jars are what's expensive and we should be comfortable to cache these steps.

postamar · 2023-10-18T18:33:58Z

The magicache cache sync isn't enabled yet so there's no point in running this more than once but here's a run on source-postgres: https://github.com/airbytehq/airbyte/actions/runs/6565091506

alafanechere · 2023-10-19T10:53:05Z

@postamar I re-ran the test pipeline:
https://github.com/airbytehq/airbyte/actions/runs/6573520735

Expectation: this will seed the persistent dependency cache.

We should trigger a second run a couple of minutes after it succeeds and check if we get performance boost: we expect the deps to be cached in the remote volume.

First run duration: 23m 5s
Second run: 22mn

postamar · 2023-10-19T13:18:35Z

@alafanechere I actually opened up a PR based on this one to test these assumptions: #31580

I looked at your runs also. Something's still off. The dependency cache is handled correctly from what I can tell, but the transient cache isn't. I think I know why.

postamar · 2023-10-19T15:11:08Z

I'm fed up with my own two-dagger-cache scheme. For a start it doesn't work. Furthermore I'm worried it's unmaintainable. It's certainly untestable. So I'm getting rid of it.

…dle-caching' into better-gradle-caching

airbyte-oss-build-runner · 2023-10-20T03:49:27Z

source-paypal-transaction test report (commit `d057b84600`) - ✅

⏲️ Total pipeline duration: 59.00s

Step	Result
Build source-paypal-transaction docker image for platform(s) linux/x86_64	✅
Acceptance tests	✅
Code format checks	✅
Validate metadata for source-paypal-transaction	✅
Connector version semver check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-paypal-transaction test

airbyte-oss-build-runner · 2023-10-20T04:21:20Z

source-paypal-transaction test report (commit `3bf9fcfe8a`) - ✅

⏲️ Total pipeline duration: 01mn20s

Step	Result
Build source-paypal-transaction docker image for platform(s) linux/x86_64	✅
Acceptance tests	✅
Code format checks	✅
Validate metadata for source-paypal-transaction	✅
Connector version semver check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-paypal-transaction test

c-p-b · 2023-10-20T12:54:19Z

When I looked at the second run (in the comment above) I found that it had run for longer because the deps cache was cold.

I've come to the bottom of why the dagger cache volume so often appears to be empty. It's because it is! As you know the dagger engine relies on buildkit. This periodically performs garbage collection, which gets rid of orphaned container layers and so forth, including cache volumes presumably. This volume contains many gigabytes of data and the gradle layers are in all likelihood equally fat. It's worth noting that the cache gets emptied less often on my laptop than on the runners; perhaps that's because the engine is less active on my machine?

The default buildkit policies are documented here: https://docs.docker.com/build/cache/garbage-collection/ I know this is a docker page but it's the same for buildkit. I don't know if the dagger engine uses the same default buildkit policy but we should probably edit whatever policy dagger uses to suit our needs: we want to hang on to the cache volumes as long as possible bearing in mind that we build lots of fat layers very often every day.

cc @cpdeethree @alafanechere

edit: here's another run to test that the pipeline works: https://github.com/airbytehq/airbyte/actions/runs/6583132357

there is a setting in the build kit engine related to this. I believe it’s called keepBytes. I currently have it set to 50%, we could bump that up to 90% or so

postamar · 2023-10-20T13:04:23Z

Indeed keepBytes is one of them, defined in [[worker.oci.gcpolicy]] sections. It's the same structure as the doc I linked and I believe there's more to it that that one setting. I'll open up a PR in airbyte-infra and we can discuss there maybe.

postamar · 2023-10-20T20:22:39Z

Running https://github.com/airbytehq/airbyte/actions/runs/6592254879 to validate.

This should be good to go once https://github.com/airbytehq/airbyte-infra/pull/63 is approved, merged and deployed in prod.

settings.gradle

alafanechere

My main concern is about setting the env var on the container in the mounted_connector_secrets function

.github/workflows/connectors_tests.yml

airbyte-ci/connectors/pipelines/pipelines/dagger/actions/secrets.py

airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/steps/gradle.py

postamar · 2023-10-23T15:26:17Z

Let's see where this goes.

Co-authored-by: postamar <[email protected]>

airbyte-ci: better gradle caching

39e0255

Marius Posta added 2 commits October 17, 2023 22:58

bump version and update changelog

94a67f7

format

ad894e8

postamar marked this pull request as ready for review October 18, 2023 03:15

postamar requested a review from alafanechere October 18, 2023 03:15

alafanechere reviewed Oct 18, 2023

View reviewed changes

Marius Posta added 5 commits October 18, 2023 09:45

apply feedback from first round of review

f375b3f

Merge branch 'master' into better-gradle-caching

19b4b4f

fix caching problems

539c25e

add comment

e312f93

Merge branch 'master' into better-gradle-caching

ec66b05

vercel bot deployed to Preview October 18, 2023 18:33 View deployment

Automated Commit - Formatting Changes

269ba97

postamar mentioned this pull request Oct 18, 2023

[DNM] tests for PR #31535 #31580

Closed

Marius Posta added 9 commits October 19, 2023 12:29

scrap transient cache, introduce s3 build cache

89a0d51

Merge branch 'master' into better-gradle-caching

b94a738

Merge branch 'master' into better-gradle-caching

32998a9

format

b1ac061

fix typo

722b652

make S3 caching contingent on credentials being available

d23c2b3

rsync really needs to be in the same step as gradle

6e48486

set s3 cache in us-west-2

cdb4f5d

add --no-watch-fs flag

4f805e9

vercel bot deployed to Preview October 20, 2023 03:32 View deployment

Automated Commit - Formatting Changes

42b4621

octavia-squidington-iii added area/connectors Connector related issues connectors/source/paypal-transaction labels Oct 20, 2023

Marius Posta added 2 commits October 19, 2023 23:44

fixes

4a7b57e

Merge remote-tracking branch 'refs/remotes/origin/postamar/better-gra…

d057b84

…dle-caching' into better-gradle-caching

octavia-squidington-iv requested a review from a team October 20, 2023 03:45

Automated Commit - Formatting Changes

3bf9fcf

bazarnov approved these changes Oct 20, 2023

View reviewed changes

Marius Posta added 2 commits October 20, 2023 14:04

mount the gradle cache volume as early as possible in the pipeline

1c9c7ca

Merge branch 'master' into better-gradle-caching

0e363e6

postamar requested a review from a team October 20, 2023 18:06

octavia-squidington-iii removed the area/connectors Connector related issues label Oct 20, 2023

vercel bot deployed to Preview October 20, 2023 18:08 View deployment

alafanechere reviewed Oct 23, 2023

View reviewed changes

settings.gradle Show resolved Hide resolved

alafanechere suggested changes Oct 23, 2023

View reviewed changes

Marius Posta added 2 commits October 23, 2023 10:18

apply review comments

d538c55

Merge branch 'master' into better-gradle-caching

a87adb0

vercel bot deployed to Preview October 23, 2023 14:28 View deployment

tweaks

3fa7a52

alafanechere approved these changes Oct 23, 2023

View reviewed changes

postamar merged commit 9835f6b into master Oct 23, 2023

postamar deleted the postamar/better-gradle-caching branch October 23, 2023 15:26

ariesgun pushed a commit to ariesgun/airbyte that referenced this pull request Oct 23, 2023

airbyte-ci: better gradle caching (airbytehq#31535)

365cfbd

Co-authored-by: postamar <[email protected]>

		"""This cache volume is for sharing gradle state across all pipeline runs."""
		return self.context.dagger_client.cache_volume("gradle-persistent-cache")

airbyte-ci: better gradle caching #31535

airbyte-ci: better gradle caching #31535

Conversation

postamar commented Oct 18, 2023 • edited Loading

vercel bot commented Oct 18, 2023 • edited Loading

postamar commented Oct 18, 2023 • edited Loading

alafanechere left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alafanechere Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

postamar commented Oct 18, 2023

postamar commented Oct 18, 2023

alafanechere commented Oct 19, 2023 • edited Loading

postamar commented Oct 19, 2023

postamar commented Oct 19, 2023

airbyte-oss-build-runner commented Oct 20, 2023

source-paypal-transaction test report (commit d057b84600) - ✅

airbyte-oss-build-runner commented Oct 20, 2023

source-paypal-transaction test report (commit 3bf9fcfe8a) - ✅

c-p-b commented Oct 20, 2023

postamar commented Oct 20, 2023

postamar commented Oct 20, 2023

alafanechere left a comment

Choose a reason for hiding this comment

postamar commented Oct 23, 2023

postamar commented Oct 18, 2023 •

edited

Loading

vercel bot commented Oct 18, 2023 •

edited

Loading

postamar commented Oct 18, 2023 •

edited

Loading

alafanechere left a comment •

edited

Loading

alafanechere Oct 18, 2023 •

edited

Loading

alafanechere commented Oct 19, 2023 •

edited

Loading

source-paypal-transaction test report (commit `d057b84600`) - ✅

source-paypal-transaction test report (commit `3bf9fcfe8a`) - ✅