Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Destination Redshift (copy): accept bucket path for staging data #8607

Merged
merged 52 commits into from
Dec 17, 2021

Conversation

edgao
Copy link
Contributor

@edgao edgao commented Dec 8, 2021

What

Create a new S3StreamCopier that utilizes the same code as destination-s3, rather than rolling its own S3 upload logic. Addresses #8550

Also, incidentally allows users to specify the bucket path to store staging data in.

How

  • Rename the existing S3StreamCopier+Factory to LegacyS3StreamCopier+Factory and @Deprecate them
  • Write unit tests to lock in their behavior
  • Create new S3StreamCopier+Factory, built on top of destination-s3 code, and copy the Legacy unit tests

Misc. notes:

  • The S3 paths are now different:
    • Legacy behavior was to create s3://bucket/randomUuid/schema/stream-12345, where 12345 is the number of the file (i.e. starts at 00000 and increments for each file). Note that there was no .csv extension at all.
    • New behavior creates s3://bucket/bucketPath/namespace/stream/timestamp_randomUuid.csv.
      • Redshift now accepts a bucketPath option.
      • The timestamp is set to when the RedshiftStreamCopier is instantiated. I'm pretty sure this is basically the time at which the AirbyteMessageConsumer is created, i.e. pretty close to the start of the upload. But technically NOT the same.
  • S3Writer#getObjectKey is unimplemented in most classes, because I didn't want to dig into all of their behaviors.
  • The new mockito:mockito-inline dependencies are needed for the mocked constructors.

Recommended reading order

  1. S3CsvWriter.java + test + StagingDatabaseCsvSheetGenerator.java + BaseS3Writer
  2. S3StreamCopier.java + test + factory
  3. LegacyS3StreamCopierTest.java - mostly there to demonstrate identical behavior with S3StreamCopierTest.java
  4. RedshiftStreamCopier.java + test + factory
  5. spec.json (adding parsing for this new option happened in https://github.com/airbytehq/airbyte/pull/8562/files#diff-234d0545ef4c8a9abd756f519d85f20cefa71acacdb93a9a8f4ce2a6f88482f3R74-R77 )
  6. S3DestinationConfig, S3CsvFormatConfig
  7. everything else: build.gradle (new mockito-inline dep), the various XyzWriter.java (autoformatting + stub getObjectKey implementation)

This commit (1b4f41f#diff-5a76fbf0219c39138b2a18b074b17dde604966094df637542b142b08cd4358b9) could be interesting; it's the diff between the legacy and new behavior.

🚨 User Impact 🚨

None. The temporary files will be named differently, but they will still be created+deleted the same way.

They will be in a different directory. Hopefully nobody was allowlisting a service account to only have permission to write into s3://bucket/anyValidUuid/*.

Pre-merge Checklist

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

@github-actions github-actions bot added the area/connectors Connector related issues label Dec 8, 2021
@edgao edgao force-pushed the edgao/s3_based_stream_copier branch from 7a9e31c to 1ce7b90 Compare December 8, 2021 21:15
@edgao edgao temporarily deployed to more-secrets December 8, 2021 21:17 Inactive
@edgao edgao temporarily deployed to more-secrets December 9, 2021 22:22 Inactive
Base automatically changed from jdbc_s3_dependency_reverse to master December 10, 2021 23:51
@edgao edgao force-pushed the edgao/s3_based_stream_copier branch from 322ce7c to 22e2bd7 Compare December 14, 2021 22:22
@edgao edgao temporarily deployed to more-secrets December 14, 2021 22:23 Inactive
@edgao edgao temporarily deployed to more-secrets December 14, 2021 22:31 Inactive
@edgao edgao changed the title [WIP] create new JDBC S3 copier based on destination-s3 code 🎉 Destination Redshift (copy): accept bucket path for staging data Dec 14, 2021
@edgao edgao temporarily deployed to more-secrets December 14, 2021 23:44 Inactive
@edgao edgao temporarily deployed to more-secrets December 15, 2021 00:27 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 16, 2021

@sherifnada do we ever do minor version bumps on connectors? I want to have destination-s3 go from 0.1.16 to 0.2.0 to reflect the change in filename convention; is that reasonable or is this still considered just a patch-level change?

i.e. this changelog entry fed6c36#diff-69de65b90934b0e7e8f813cfa8298d11e6cc1104e9f64e3a010eee0441b81db9R226

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Dec 16, 2021
@edgao edgao temporarily deployed to more-secrets December 16, 2021 01:11 Inactive
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao doing a minor version bump is totally fine. Once we push connectors to GA we'll also start doing honest-to-god semver.

* @return The path within the bucket that this writer will create. For example, if we wrote to "s3://yourBucket/some/path/to/file.csv", this method
* would return "some/path/to/file.csv".
*/
String getObjectPath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would getOutputPath be accurate?

docs/integrations/getting-started/destination-redshift.md Outdated Show resolved Hide resolved
outputStreams.get(0).toString(StandardCharsets.UTF_8));
}

private static void checkObjectName(final String objectName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we really just want to test the name generator, and that the name generator is being used. this is fine for now IMO, may be worth leaving this context you provided as a comment and saying output path generator should be probably be DI'd and we can just test the generator itself? in any case this is fine

edgao and others added 2 commits December 16, 2021 07:15
Co-authored-by: Sherif A. Nada <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>
@edgao edgao temporarily deployed to more-secrets December 16, 2021 15:17 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 16, 2021

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1589440936
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1589440936
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     124      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 494    313    37%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     32    78%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1199    501    58%

@edgao edgao temporarily deployed to more-secrets December 16, 2021 21:07 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 16, 2021 21:08 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 16, 2021

/publish connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1589741011
❌ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1589741011

@edgao
Copy link
Contributor Author

edgao commented Dec 16, 2021

/publish connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/1589741281
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/1589741281

@jrhizor jrhizor temporarily deployed to more-secrets December 16, 2021 22:39 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets December 16, 2021 22:39 Inactive
@edgao
Copy link
Contributor Author

edgao commented Dec 16, 2021

/publish connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1589938384
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1589938384

@jrhizor jrhizor temporarily deployed to more-secrets December 16, 2021 23:46 Inactive
@edgao edgao temporarily deployed to more-secrets December 17, 2021 00:19 Inactive
@edgao edgao temporarily deployed to more-secrets December 17, 2021 00:38 Inactive
@edgao edgao merged commit 7038533 into master Dec 17, 2021
@edgao edgao deleted the edgao/s3_based_stream_copier branch December 17, 2021 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants