Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move S3Config into destination-s3; update dependencies accordingly #8562

Merged
merged 11 commits into from
Dec 10, 2021

Conversation

edgao
Copy link
Contributor

@edgao edgao commented Dec 6, 2021

What

It doesn't make much sense for destination-jdbc to hold S3Config, or for destination-s3 to depend on destination-jdbc. This is a prereq for #8550.

Also delete the unnecessary JDBC acceptance test - it doesn't even work (b/c destination-jdbc doesn't even declare a main class).

Also update the S3 docs to reflect a past spec.json update.

How

Replace references to S3Config with S3DestinationConfig, then update module dependencies. Most importantly, remove s3's dependency on jdbc.

Note that a few class members got moved around to accommodate this:

  • S3StreamCopier#DEFAULT_PART_SIZE now lives inside S3DestinationConfig; this was the only place it was referenced anyway.
  • S3StreamCopier's attemptS3WriteAndDelete and getAmazonS3 methods now live on S3Destination. This necessitated updates to the JDBC, Redshift, and Snowflake destinations, which now list destination-s3 as a dependency.

Also had a few formatting changes for some reason, which intellij insists on making. Intellij is set up according to this, so not quite sure what's happening.

Recommended reading order

I'd actually recommend reviewing the two commits separately; they're only in one PR to keep this change atomic.

The first commit moves S3Config into destination-s3, and makes any module that depended on destination-jdbc now depend on destination-s3.

  1. S3Config.java
  2. S3Destination.java
  3. S3StreamCopier.java
  4. Everything else - this boils down to FQCN/gradle dependency updates.

The second commit moves all S3Config references to use S3DestinationConfig instead.

  1. S3DestinationConfig.java - all of the changes are just copying in code from the old S3Config
  2. Everything else. Again, sorry for all the messy format changes :(

🚨 User Impact 🚨

If anyone had code that depended on S3Config / S3StreamCopier (i.e. outside of this repo) then that will break. Otherwise, this should have no visible impact.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

@CLAassistant
Copy link

CLAassistant commented Dec 6, 2021

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the area/connectors Connector related issues label Dec 6, 2021
@edgao edgao temporarily deployed to more-secrets December 6, 2021 23:25 Inactive
@edgao edgao force-pushed the jdbc_s3_dependency_reverse branch from 2e4e078 to 34dd5f0 Compare December 6, 2021 23:27
@edgao edgao temporarily deployed to more-secrets December 6, 2021 23:29 Inactive
@edgao edgao marked this pull request as ready for review December 7, 2021 00:05
@edgao edgao requested a review from sherifnada December 7, 2021 00:05
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo unit tests

s3.deleteObject(s3Bucket, outputTableName);
}

public static AmazonS3 getAmazonS3(final S3DestinationConfig s3Config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a unit test make sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test, but it's pretty mediocre. AmazonS3 objects don't actually expose any methods to extract their configurations, so all the test accomplishes is to check that this method doesn't throw an exception, and that it returns something non-null :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I can just kill this method and use S3DestinationConfig#getS3Client instead

@edgao edgao temporarily deployed to more-secrets December 7, 2021 16:15 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 7, 2021

hm. I'm pretty sure DefaultAirbyteSourceTest and DefaultAirbyteDestinationTest shouldn't be breaking due to these changes; is there a way to rerun just the one specific check?

@edgao edgao temporarily deployed to more-secrets December 7, 2021 17:37 Inactive
@edgao edgao temporarily deployed to more-secrets December 7, 2021 18:01 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 7, 2021

hmmm. between the docs and (I think?) the spec, is spec.json canonical or are we building towards what docs.airbyte describes? Specifically, the docs say that part size is a top-level option for destination-s3, but the spec is putting it inside the lower-level avro/csv/jsonl format objects, with parquet not exposing it at all

(the partSize field on S3DestinationConfig is actually redundant with S3FormatConfig#getPartSize, but not sure which one would be better to remove)

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sherifnada
Copy link
Contributor

@edgao is that option relevant for parquet? I'm not seeing a default value being set in the code. @tuliren do you have any context on why page size is not relevant for parquet s3?

@tuliren
Copy link
Contributor

tuliren commented Dec 7, 2021

@tuliren do you have any context on why page size is not relevant for parquet s3?

Page size is relevant for Parquet. Part size is not.

Here is how we construct the S3 Parquet writer, it requires the page size for compression:

https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/parquet/S3ParquetWriter.java#L63

Here is how we construct the routine S3 writer (for JSONL and CSV), it requires the part size for uploading:

https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/util/S3StreamTransferManagerHelper.java#L22

@tuliren
Copy link
Contributor

tuliren commented Dec 7, 2021

hmmm. between the docs and (I think?) the spec, is spec.json canonical or are we building towards what docs.airbyte describes? Specifically, the docs say that part size is a top-level option for destination-s3, but the spec is putting it inside the lower-level avro/csv/jsonl format objects, with parquet not exposing it at all

(the partSize field on S3DestinationConfig is actually redundant with S3FormatConfig#getPartSize, but not sure which one would be better to remove)

@edgao, the doc is outdated. The part size is only needed for Avro, CSV, and JSONL. It is not required for Parquet, at least not for now, because the Parquet writer is constructed differently and does not require this parameter.

So the current spec.json is correct that the part_size_md only exists in the low level format object.

If I remember correctly, when the S3 destination was first rolled out, it only supported CSV. That's why the part size was at the top level initially. But when we added Parquet as the second format, we realized that part size should not be a top level parameter, and moved it inside format. However, the doc did not get updated accordingly.

Copy link
Contributor Author

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will update the docs to match the spec.

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Dec 8, 2021
@edgao edgao temporarily deployed to more-secrets December 8, 2021 21:08 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 8, 2021

latest commits just do some final code cleanup (replacing calls to S3StreamCopier.attemptWriteAndDelete with the S3Destination method) and updating the S3 docs to match the spec. Doc update isn't super detailed, it just removes the part_size param and links to spec.json instead of referencing a nonexistent paragraph.

Do I need to run through all the "Airbyter" stuff in the PR template? I.e. /test connector=connectors/<name>, release to dockerhub, etc? Or can I just merge this PR, and the docs update will propagate automatically?

@edgao
Copy link
Contributor Author

edgao commented Dec 10, 2021

/test connector=destination-snowflake

🕑 destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1561440688

@jrhizor jrhizor temporarily deployed to more-secrets December 10, 2021 01:11 Inactive
@edgao edgao temporarily deployed to more-secrets December 10, 2021 18:54 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 10, 2021

/test connector=destination-snowflake

🕑 destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1565195945
✅ destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1565195945
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468    287    39%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    472    59%

@edgao
Copy link
Contributor Author

edgao commented Dec 10, 2021

/test connector=destination-redshift

🕑 destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1565196085
✅ destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1565196085
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    12      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     120      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 11      0   100%
	 normalization/transform_catalog/stream_processor.py                 468    287    39%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         140     29    79%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1160    472    59%

@edgao
Copy link
Contributor Author

edgao commented Dec 10, 2021

/test connector=destination-gcs

🕑 destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1565196940
✅ destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1565196940
No Python unittests run

@edgao edgao temporarily deployed to more-secrets December 10, 2021 21:08 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 10, 2021 21:09 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 10, 2021 21:09 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 10, 2021 21:09 Inactive
@edgao edgao temporarily deployed to more-secrets December 10, 2021 21:56 Inactive
@edgao edgao merged commit fc91f67 into master Dec 10, 2021
@edgao edgao deleted the jdbc_s3_dependency_reverse branch December 10, 2021 23:51
@edgao edgao mentioned this pull request Dec 11, 2021
16 tasks
jrhizor added a commit that referenced this pull request Dec 11, 2021
* upgrade gradle

* upgrade to Java 17 (and fix a few of the node versioning misses)

* oops

* try to run a different format version

* fix spotless by upgrading / reformatting some files

* fix ci settings

* upgrade mockito to avoid other errors

* undo bad format

* fix "incorrect" sql comments

* fmt

* add debug flag

* remove

* bump

* bump jooq to a version that has a java 17 dist

* fix

* remove logs

* oops

* revert jooq upgrade

* fix

* set up java for connector test

* fix yaml

* generate std source tests

* fail zombie job attempts and add failure reason (#8709)

* fail zombie job attempts and add failure reason

* remove failure reason

* bump gcp dependencies to pick up grpc update (#8713)

* Bump Airbyte version from 0.33.9-alpha to 0.33.10-alpha (#8714)

Co-authored-by: jrhizor <[email protected]>

* Change CDK "Caching" header to "nested streams & caching"

* Update fields in source-connectors specifications: file, freshdesk, github, google-directory, google-workspace-admin-reports, iterable (#8524)

Signed-off-by: Sergey Chvalyuk <[email protected]>

Co-authored-by: Serhii Chvaliuk <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>

* move S3Config into destination-s3; update dependencies accordingly (#8562)

Co-authored-by: Lake Mossman <[email protected]>
Co-authored-by: jrhizor <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>
Co-authored-by: Iryna Grankova <[email protected]>
Co-authored-by: Serhii Chvaliuk <[email protected]>
Co-authored-by: Edward Gao <[email protected]>
edgao added a commit that referenced this pull request Dec 11, 2021
@@ -11,6 +11,9 @@ WORKDIR /airbyte

ENV APPLICATION destination-snowflake

# Needed for JDK17 (in turn, needed on M1 macs) - see https://github.com/snowflakedb/snowflake-jdbc/issues/589#issuecomment-983944767
ENV DESTINATION_SNOWFLAKE_OPTS "--add-opens java.base/java.nio=ALL-UNNAMED"
Copy link
Contributor

@VitaliiMaltsev VitaliiMaltsev Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao looks like issue described here with jdk17 still persists in master branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm not sure what happened - snowflake was definitely passing for me locally, but after the java 17 upgrade it's failing again. Haven't been able to find a fix, either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao @VitaliiMaltsev Were you guys able to find the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao is there any side effect using JVM command jvmArgs = ["--add-opens=java.base/java.nio=ALL-UNNAMED"] ?
Any potential security issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that this isn't a security check, it's more so that the Java devs can modify some Java internal implementation in the future. I.e. they want to discourage reliance on java's internal APIs, so access is now disabled by default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao thanks for the response. However, based on the JEPS-261:

The --add-exports and --add-opens options must be used with great care. You can use them to gain access to an internal API of a library module, or even of the JDK itself, but you do so at your own risk: If that internal API is changed or removed then your library or application will fail.

I'm not that familiar with how gaining more internal access to java.nio could be concerning. Therefore, asking if you have insight into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risk occurs when upgrading Java versions - there's no guarantee that the underlying implementation will remain the same. So in theory, the Java devs could release a new version of Java, which modifies those internal APIs, and therefore break the snowflake JDBC driver. For Airbyte, this risk is mitigated by the source/destination acceptance tests, which actually build the Docker image and check that it can interact with a real Snowflake instance, i.e. this would be caught before release. But we would be unable to upgrade to that new Java version until an updated snowflake-jdbc is released.

I'm not actually sure which issue is being used to track progress on removing snowflake-jdbc's dependency on those internal APIs, but would recommend asking in https://github.com/snowflakedb/snowflake-jdbc if you're interested in the details / how they're thinking about the problem. Some potentially related issues: snowflakedb/snowflake-jdbc#484, snowflakedb/snowflake-jdbc#589, snowflakedb/snowflake-jdbc#533

schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
* upgrade gradle

* upgrade to Java 17 (and fix a few of the node versioning misses)

* oops

* try to run a different format version

* fix spotless by upgrading / reformatting some files

* fix ci settings

* upgrade mockito to avoid other errors

* undo bad format

* fix "incorrect" sql comments

* fmt

* add debug flag

* remove

* bump

* bump jooq to a version that has a java 17 dist

* fix

* remove logs

* oops

* revert jooq upgrade

* fix

* set up java for connector test

* fix yaml

* generate std source tests

* fail zombie job attempts and add failure reason (airbytehq#8709)

* fail zombie job attempts and add failure reason

* remove failure reason

* bump gcp dependencies to pick up grpc update (airbytehq#8713)

* Bump Airbyte version from 0.33.9-alpha to 0.33.10-alpha (airbytehq#8714)

Co-authored-by: jrhizor <[email protected]>

* Change CDK "Caching" header to "nested streams & caching"

* Update fields in source-connectors specifications: file, freshdesk, github, google-directory, google-workspace-admin-reports, iterable (airbytehq#8524)

Signed-off-by: Sergey Chvalyuk <[email protected]>

Co-authored-by: Serhii Chvaliuk <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>

* move S3Config into destination-s3; update dependencies accordingly (airbytehq#8562)

Co-authored-by: Lake Mossman <[email protected]>
Co-authored-by: jrhizor <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>
Co-authored-by: Iryna Grankova <[email protected]>
Co-authored-by: Serhii Chvaliuk <[email protected]>
Co-authored-by: Edward Gao <[email protected]>
schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants