Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Load CDK: Unwrap multipart streaming upload #48810

Merged

Conversation

johnny-schmidt
Copy link
Contributor

What

This is a first step toward separating processing from uploading

  • add ObjectStorageClient/S3Client support for uploading a part at a time (leaving the old style in place for now until it can be removed from all destinations/tests)
  • add a BufferingFormattedWriter that wraps the Avro/Json/Csv/Parquet writers, accumulating the records and yielding complete parts
  • change StreamLoader::processRecords so that it feeds the writer and uploads the parts as they're ready

Copy link

vercel bot commented Dec 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Dec 6, 2024 0:26am

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/destination/s3-v2 labels Dec 4, 2024
@johnny-schmidt johnny-schmidt marked this pull request as ready for review December 4, 2024 23:27
@johnny-schmidt johnny-schmidt requested a review from a team as a code owner December 4, 2024 23:27
private val streamProcessor: StreamProcessor<T>,
private val wrappingBuffer: T
) : ObjectStorageFormattingWriter {
override val numCapturedChanges: AtomicLong = writer.numCapturedChanges
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? I only see it used for logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's some of what I added for testing, but if feels like useful information.


fun isDataSufficient(): Boolean {
return buffer.size() >= partSize
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this seems orthogonal to the buffered writer. Since it's called strictly externally, we can easily pull this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should just be a size check.

metadata: Map<String, String>
): StreamingUpload<S3Object> {
// TODO: Remove permit handling once we control concurrency with # of accumulators
if (uploadPermits != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go ahead and move the semaphore upstream to get it out of the client code at least?

Copy link
Contributor

@tryangul tryangul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, pending nit change not withstanding

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/load-cdk/separate-processing-from-upload branch from c44bb77 to c0dcc58 Compare December 5, 2024 00:56
@johnny-schmidt
Copy link
Contributor Author

I also ripped out the logging.

@johnny-schmidt johnny-schmidt force-pushed the jschmidt/load-cdk/separate-processing-from-upload branch from c0dcc58 to 5000a21 Compare December 5, 2024 02:09
@johnny-schmidt johnny-schmidt enabled auto-merge (squash) December 5, 2024 02:09
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/load-cdk/separate-processing-from-upload branch 4 times, most recently from a8f42b9 to 6b486f6 Compare December 5, 2024 23:48
@johnny-schmidt johnny-schmidt force-pushed the jschmidt/load-cdk/separate-processing-from-upload branch from 6b486f6 to 7f0a7ae Compare December 6, 2024 00:25
@johnny-schmidt johnny-schmidt merged commit 8beb1d1 into master Dec 6, 2024
33 checks passed
@johnny-schmidt johnny-schmidt deleted the jschmidt/load-cdk/separate-processing-from-upload branch December 6, 2024 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues CDK Connector Development Kit connectors/destination/s3-v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants